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ABSTRACT 

An attempt at defining and describing those factors 
which most often jeopardize the validity of naturalistic behavioral 
data is presented* A number of investigations from many laboratories 
which demonstrate these methodological problems are reviewed* Next, 
suggesticms, implementations, and testing of effectiveness of various 
solutions to these dilemmas of methodology are steps taken. Research 
in the paper involves the observaticm of both "normal a^d ••deviant*' 
children and families in the home setting. The observation system 
employed is a modified form of the code devised by Patterson, Ray, 
Shaw, and Cobb (1969) . The observations are made under certain 
restrictive conditions: (1) All family members must be present in two 
adjoining rooms; (2) No interactions iid.th the observer are permitted; 
(3) The television set may not be on; and {«) No visitors or extended 
telephone calls are permitted. Other later studies are also reviewed 
in this paper. (CK) 
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METHODOLOGICAL ISCUIT) IN NATURALISTIC OBSERVATION: 
SOME PROBLEMS AND SOL^TIO^iS i'OR FIELD RESEARCii"^ 
Stephen ^. Johnson rin<] Ovin D. Bolstad 
University cf Oregon 

Encapsulated schools cf thougiit have oc curried in all scioncc; at 
some stage in their deveiop^nent . They c-^poear most frequently during 
periods where the fundamentc 1 assumptijns of the science are in question. 
Manifesto papers, acrimonious controversy, mutual rejection, and isola- 
tion of other schools' strategies are hallmarks of such episodes [David 

. Kr3ntz, The separate worlds of opei^ant and non-operant psychology. 
Journal of Applied Behavior Analysis , 1971, U (1), p, 61]. 

History may well reveal that the greatest contribution of behavior 
modific^.tion to the treatment -^f human pro! lems caine with its eiiphasis on 
the collection of behavioraJ data* in raturai ^ettinps. The growth of t!io 
field will surely cc.itinue to produce greater refinerr.ent an-.l proliferation 
of specific behavior change proc-duros, bur the critical standard for 
assessing taeir utility will ver> likely remain the sane. We will always 
want to know how a given procedure affects the subject's relevant behavior 
in his "real" \:orld. 

If a behaviorist wants to convince someone of the correctness of his 
approach to treating human problems. He is generally much less likely to 
rely on logic, authority, or personal testimonials to persuade than are 
proponents of other schools of psychotherapeutic thought. Rather, it is 
most likely that he will show his behavioral data with the intimation that this 
data speaks eloquently for itself. Because he is aware of the research on 
the low level of generalizability of behavior across settings (e.g., see 
Mischel, 1968), he is likely -:o be more confident ^. this data as it 
becomes more naturalistic in character (i.e., as it reflects naturally 
occurring behavior in the subject's usual habitat). As a perusal of the 
behavior modification literature will indicate, these data are often 
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Johnson and Bolstad 2 
extremely persuasive. Yet, the apparent success of behavior modification 
and the enthusiasm that this success breeds may cause all of us to take 
an uncritical approach in evaluating the quality of that data on which the 
claims of success are based. A critical review of the naturalistic data 
in behavior modification research will reveal that most of it is gathered 
under circumstances in which a host of confounding influences can oper- 
ate to yield invalid results. The observers employed are usually aware 
of the nature, purpose and expected ^^esults of the observation. The 
observed are also usually aware of befng watched and often they also know 
the purpose and expected outcome of the observation. The procedures for 
gathering and computing data on observer agreement or accuracy are inap- 
propriate or irrelevant to the purposes of the investigation. Tliere is aLr^ost 
never an indication of the reliability of the dependent variable under study, 
and rarely is there any systematic data on the convergent validity of the 
dependent measure(s). Thus, by the standards employed in some other areas of 
psychological research, it can be charged that much behavior modification 
research data is subject to observer bias, observee reactivity, fakability, 
demand characteristics, response sets, and decay in instrumentation. In 
addition, the accuracy, reliability and validity of the data used is often 
unknown or inadequate?^.y established. 

But, the purpose of this paper is not to catalogue our mistakes or to 
argue for the rejection of all but the purest data. If that were the case, 
we would probably have to conclude with that depressing note which makes 
so many treatises on methodology so discouraging. Although dressed in more 
technical language, this purist view often expresses itself as: "You 
can't get there from here." We can get there, but it's not quite as 



er|c 



5 



Johnson and Bo Is tad 3 
simple PS perhaps we were first led to believe. The first step in getting 
there is to define and describe those factors which most often jeopardize 
the validity of naturalistic behavioral data. To this end, we will review 
a host of investigations from many laboratories which demonstrate these 
methodological problems. The second step is more constructive in nature: 
to suggest, implement, and test the effectiveness of various solutions to 
these dilemmas of methodology. Because behavioral data has become the 
primary basis for our approa-n to diagnosing and treating human problems, 
the endeavor to improve methodology is perhaps our most critical task for 
strengthening our contribution to the science of human behavior • 

We will argue that the same kinds of methodological considerations 
which are relevant in other areas of psychology are equally pertinent for 
behavioral research. At least with respect to the requirements of sound 
methodology, tho time of isolation of behavioral p ,ychology ^rom other 
areas of the discipline should quickly come to an end. 

Throughout this paper, we will rely heavily on the experience of our 
own research group in meeting, or at least attenuating, these problems. 
We take this approach to illustrate the problems and their possible solu- 
tions more precisely and concretely. Most of our solutions are far from 
perfect or final, but it is our hope that a report based on real experi- 
ence and data may be more meaningful than hypothetical solutions which 
rem,\in untested. Thus, before beginning or the outline of methodological 
problems and their respective solutions, it will be necessary for the 
reader to have a general understanding of the purposes and procedures of 
cur research. This research involves the observation of both "normal" 
and "deviant" children and families in the home setting. The observation 
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Johnson and Bolstad 2^ 
system employed is a modified form of the code devised by Patterson, Ray, 
Shaw, and Cobb (1969). This reviser^ system utilizes 35 distinct behavior 
categories to record all of the behaviors of the target child and all behaviors 
of other family members as they interact with this child. The system is 
designed for rapid sequential recording of the child's behavior, the respon- 
ses of family members, the child's ensuing response, etc. Observations 
are typically done for forty-five minutes per evening during the pre-dinner 
hour for five consecutive week nights. The observations are made under 
certain restrictive conditions: a) All family members must be present in 

» 

two adjoining rooms; b) No interactions with the observer are permitted; 
c.) The television set may not be on; and, d) No visi tors or extended tele- 
phone calls are permitted. Obviously, this represents a modified natural- 
istic situation. 

On the average, these procedures yield the recording of between 1,800 
and 1,900 responses and an approximately equal number of responses of other 
family agents over this time period of 3 houi-s and k5 minutes. This data is 
collected in connection with a number of interrelated projects. These include 
normative research investigations of the "normal" child (e.g., Johnson, Wahl, 
Martin & Johansson, 1972); research involving a behavioral analyi^is of the 
child and his family (e.g., Wahl, Johnson, Martin & Jchensson, 1972; 
Karpowitz, 1972; Johansson, Johnson, Martin, & Wahl, 1971); outcome research on 
the effects of behavior modification intervention in families (Eyberg, 
1972); comparisons of "normal" and "deviant" child populations (Lobitz & 
Johnson, 1972); and studies of methodological problems (Johnson & Lobitz, 1972; 
Adkins & Johnson, 1972: Martin, I971). these latter studies will be 
reviewed in detail in the body 01 th. 3 paper. More recently, ve have begun 
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to investigate the generaiity of children's behavior across school and home 
settings, and to document the level of generalization of the effects of 
behavior modification in one setting to behavior in other settiii'^s (Walker, 
Johnson, & Hops, 1972). Research is also in progress to relate naturalistic 
behavioral data to parental attitudes and behavioral data obtained in more 
artificial laboratory settings. With all of these objectives in mind, it 
is most critical that the behavioral data collected is as valid as 
possible and it is to this end that we explore the complex problems of 
methodology presented here. 

Observer Agreement and Accuracy I: 
Problems of Calculation and Inference 
ITie most widely recognized requirement of research involving behavioral 
observations is the establishment of the accuracy of the observers. This is 
typically done by some form of calculation of agreement between two or more 
observers in the field. Occasionally, observers are tested for accuracy by 
comparing their coding of video or audio tape with some previously established 
criterion coding of the recorded behavior. For convenience, we wiU refer to 
the former procedure as calculation of observer agreement and the latter as 
calculation of observer accuracy. In general, both of these procedures have 
been labeled observer reliability. We will eschew this terminology because it 
tends to confuse this simple requirement for observer agreement or accuracy 
with the concept of the reliability cf a test as understood in traditional 
test theory. As we shell outline in section three, it is quite possible 
^ . ^° h^^^ perfect observer agreement or accuracy on a given behavioral 

score with absolutely no reliability or consistency of measurement in the 
traditional sense. Generally, the classic reliability requirement involves 
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Johnson and Bolstad 5 
a demand for consistency in the measurement instrument over time (e.g., 
test-retest reliability) or over-sampled item sets responded to at roughly 
the same time (e.g., split-half reJ iability ) . An example may help clarify 
this point. If two computers score the Sc^ne MMPI protocol identically, 
there is perfect '^observer agreement'' but this in no way means that the 
MMPI is a reliable test which yields consistent scores. Although the 
question • f reliability as traditionally understood has been largely ignored 
in behavioral research, we will argue in section three that it is a critical 
methodological requirement which should be clearly distinguished from ob- 
server agreement and accuracy. 

There is no one established way to assess observer agreement or 
accuracy and that is as it should be, because the index must be tailored 
to suit the purposes of each individual investigation. There are three 
basic decisions which must be made in calculating observer agreement. The 
first decision involves the stipulation of the unit score on which the index 
of agreement should be assessed. In other words, what is the dependent 
variable for which an index of accuracy is required as measured by agree- 
ment with other observers or with a criterion? An example from our own 
research may help clarify this point. We obtain a "total deviant behavior 
score*' for each of the children we observe. This score is based on the sum 
output of 15 behaviors judged to be deviant in nature* An outline of the 
rationale and validity of this score will be given in a later section. 
Suffice it to say, whenever two observers watch the same child for a given 
period, they each come up with their own deviant behavior score. These 
scores may then be compared for agreement on overall frequency. It is 
obvious that the same deviant behaviors need not be observed to get high 
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indexes of agreement on the total number of deviant behaviors observed. 

for many o^ our purposes, this is not important, since we merely want 
an index of the overall output of deviant behavior over a given period. 
The same procedure is, of course, applicable to one behavior only, chains 
of behavior, etc. The point is that the researcher must decide what unit 
IS of interest to him for his purposes and then compare agreement data on 
that variable. In complex coding systems, like the one used in our labor- 
atory, it has been customary to get an overall percent agreement figure 
which reflects the average level of agreement within small time blocks 
(e.g., 6-10 seconds) over all codes. In general, we would argue that this 
kind of observer agreement data is relatively meaningless. It has limited 
meaning because it is based on a combination of codes, some of which are 
observed with high consensus and some which are not. Furthermore, the figure 
tends to overweight those high rare behaviors which are usually observed 
with greater accuracy and underweight those low frequency behaviors which 
are usually observed with less accuracy. Patterson (personal communication) 
has reported that the observer agreement on a code correlates .U9 with 
its frequency of use. Since it is often the low base rate behaviors which 
are of most interest to researchers, this overall index of observer agree- 
ment probably overestimates the actual agreement on those variables of 
most concern. 

The second question to be faced involves the time span within which 
common coding is to be counted as an agreement. For most purposes of our 
current research, score agreement over the entire 225 minutes of observa- 
tion is adequate. Thus, when we compute the total deviant behavior score 
over this period, we do not know that each c4)server sees the same deviant 
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behavior at the same time. But, good agreement on the overall scoixt tells 
us that we have a consensually validated estimate of the child's overall 
deviancy. For some research purposes, this broad ti»me span for agretiinent 
would be totally inadequate. For conditional probability analysis of one 
behavior (cf. Patterson 6 Cobb, 1971), for example, one needs to know 
that two observers saw the same behavior at the same time and (depending 
on the question) that each observer also saw the same set or chain of 
antecedents and/or consequences. This latter criterion is extremely 
stringent, particularly with complex codes where low rate behaviors are 
involved, but these criteria are necessary for an appropriate accuracy 
estimate . 

Once one has decided on the score to bo analyzed and the temporal 
rules for obtaining this score, one must then face the problem of what to do 
with these scores to give a numerical index of agreement. The two most 
common methods of analysis are percent agreement and ?ome form of correla- 
tional anal/cis over the two sets of values. Both methods may, of course, 
be used for observer agreement calculation within one subject or across a 
group of subjects. Once again, neither method is always appropriate for 
every problem and each has its advantages and disadvantages. The most 
common wa> of calculating observer agreement involves the following simple 
formula: 

numl)er of agreements 

number of agreements + disagreements 

What is defined as an agreement or disagreement has already been solved if 
one has decided on the "score'' to be calibrated and the time span involved, 

Uce of this formula impli'=^3 , however, that one must be <itae to dis- 
criminate the occurrence of both agreements and disagreement:? « This, can 
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only be accomplished precisely when the time span covered is r lativoly 
small (e.g., 1-.15 seconds) so that one can be reasonably sure that two 
observers agreed or disagreed on the same coding unit. It has been common 
practice for investigators to compare recorded occurrences of behavior 
units over much longer time periods and obtain a percent agreement figure 
between two observers which reflects the followmr,: 

smaller number of observed occurrences 
larger number of observed occurrences 

The present authors would view this as an inappropriate procedure because 
there is no necessary "agreanent'' implied by the resulting percent. If 
one observer sees 10 occurrences of a behavior over a 30-minute period and - 
the other sees 12, there is no assurance that they were ever in agreement. 
The behavior could have occurred 22 or more times and there could be abso- 
lutely no agreement on specific events. The two observers did not necessarily 
agree 8^4^ of the time. Data of this kind can be more appropriately analyzed 
by correlational methods if such analysis is consistent with the way in 
which Che data is employed for the question under study. Although the same 
basic problem mentioned above can, of course, occur, the correlational 
method is viewed as more appropriate because; a) The correlation is computed 
over an arrajr of subjects or observation time segments and b) The correlation 
reflects the level of agreement on the total obtained score and it does not 
jj^ply any agreement on specific events . 

Whenever us^^ng the appropriate method of calculating observer agreement 
neropnt (i t^- number of agreement s x 

^ ^'-^^ number of agreements + disagreements ^ investigator 

should be particularly cognizant of the base rate problem. That is, the 
obtained percent agreement figure should be compared with the amount of 
agreement that could be obtained by chance. An example will clarify this 
point. Suppose two coders are coding on a binary behavior coding system 
(e.g., appropriate vs. inappropriate behavior). For the sake of illustra- 
tion, let us suppose that observers have to characterize the subject's 
behavior as either appropriate or inappropriate every five seconds. Now, 
let us suppose, as is usually the case, that most of the subject's behavior 
is appropriate. If the subject's behavior were appropriate 90% of the time. 
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two observers coding randomly at these base rates (i.e., .90-,10) will obtain 

82% agreement by chance alone. Chance agreement is computed by squaring the 

3 

base rate of each code category and summing these values. In this simple 

2 2 

case, the mathematics would be as follows: .90 + .10 = •82. Tlie same pro- 
cedure may, of course, be used with multi-code syscems. 

The above .90-. 10 split problem may be reconceptualized as '^ne in which 
the occurrence or nonoccurrence of inappropriate behavior is coded every five 
seconds. If, for purposes of computing observer agreement, we look at only 
those blocks in which at least one of two observers coded the -ccurrence 
of inappropriate behavior, the chance level agreenif»nt is drastically reduced. 

The probability that ^wo observers would code occurrence in the same block by 
2 

chance is only .10 or one percf^ni. It would not be theoretically inappro- 
priate to count agreement on nonoccurrence but , in thr present example and 
in most cases, this procedure is associated with relatively high levels of 
chance agreement. 

Whenever percent agreement data is reported ^ the base rcte chance agree- 
ment should also be reported and the difference noted. Statistical tests of 
that difference can, cf course, be computed, ^s long as the baae rate data 
is reported, the percent agreement figure would always seen to be appropriate. 
For obvious reasons, however, it becomes less satisfactory as toe chance agree- 
ment figure approaches 1.0. 

The other common method of computing agreement data is by means of a corre- 
lation between two sets of observations. The values may be scores from a group 
of subjects or scores from ii observation segments on one subject. This method 
is particularly useful when one is faced with the high chance agreement problem 
or where the requirement of simple similarity in ordering subjects on the depen- 
dent variable is sufficient for the research. As we shall illustrate, the 
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correlation is also particularly useful in cases where one has a limited 
sample of observer agreement data relative to the total amount of observation 
data. In general, correlations have been used with data scores based on 
relatively large time samples. In other words, they 
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tend to be used for summary scores on individuals over periodc oi 10 minutes 
to 2U hours. There is no reason why correlation methodology couJd not bo 
applied to data from smaller time segments (e.g., 5 seconds), but this has 
rarely been done. So, studies using correlation methods have generally been 
those in which one cannot be sure that the same behaviors are beinr, jointly 
observed at the same time. In using correlation methods for estimating 
agreement, one should be aware of two phenomena. First, it is possible to 
obtain high coefficients of correlation when one observer consistently 
overestimates behavioral rates relative to the other observer. V . dif- 
ference can be rather large, but if it is consistently in one direction, 
the correlation can be quite high. For some purposes this proble^n would :.e 
of little consequence but for other purposes it could be of considorai lo 
importance. The data can be examined visually, or in other more Gystomatic 
ways, t-) see to what extent this is the cane. This problem can be virtually 
eliminated if one uses many observers and arranges for all of them to cali- 
brate each other for agreement data. Under these circumstances, one will 
obtain a collecrion of regular observer figures and a list of mixed cali- 
brator figures for correlation. This procedure should generally correct for 
systematic individual differences and make a consistent pattern as outlined 
above extremely unlikely. The second problem to be cognizant of in using 
correlations is that higher values become more possible as the range on 
the dependent variable becomes greater. This fact may lead to high indexes 
of agreement when observers are really quite discr* pant with respect to 
the number of a given behavior they are observing. An illustration may 
clarify this point. Let us suppose we are observing rates of crying and 
whining behavior in preschool children over a five-hour period. Some 
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particularly "good" children may display these behaviorj; vory litllo and, 
given a true occurrence score of 7, two observers may obtain scores of 5 
and 10 on this behavior class* This would be only 50% agreement* Other 
children display these behaviors with moderate to very hif:n equency. For 
a child with high frequency, we may find our two observerii f;ivin^ us scores 
of 75 and 125 respectively. This would be equivalent to 60% agreement and^ 
of course, represents a raw discrepancy of 50 occurrences. Y*»t, if these 
examples were repeated throughout the distribution of scores -ind if there 
were little overlap, a high correlation would be obtained. Trns would be 
even more true, of coiirse, if one observer consistently over* ^imated the 
rates observed by the other. Yet, even this possibility doe^ not necessarily 
jeopardize the utility of the method. It must merely be recor.nized, exojnined 
and its implication for the question under .j evaluated. In our own 
research we want to catalogue the deviancy rates of normal chi laren, coiopsLre 
them with deviant children, and observe changes in deviancy r^ines as a 
result of behavior modification training with parents. For th^se purposes, 
general eigreement on levels of deviant responding is quite good enough. 

In o\ir research on the normal child, we have had ^7 families of the 
total 77 families observed for the regular five-day period by an assip;ned 
observer. On one of these days an additional '^server was sent to the family 
for the purpose of checking observer agreement. The correlation between the 
deviant behavior scores of the two observers was .80. But, in a purely 
statistical sense, this figure is an underestimate of what the agreoment 
correlation would be for the full five days of observation. Since we are 
using a statistic based on five times as much dpta, ve want to know the expected 
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observer agreement correlation for this extended period. Addinp; time to an 
observation period is analogous to adding items to a test. The problem we are 
faced vith here is very similar to that dealt with by traditional test theorists 
who have sought, for example, to estimate the reliability of an entire 
test based on the reliability of some portion of the test. In our case, 
we want to know the expected correlation for the statistic based on five 
days when we have the correlation based on one day. The well-known 
Spearman-Brown formula (Guilford, 195^) may be applied to this end (as in 
Patterson, Cobb, & Pay, 1972; Patterson & Reid, 1970; Reid, 1967).^ 



where r^^ = reliability of the test of unit length 
in = length of total test. 
With the Spearman-Brown correction, the expected observer agreement corre- 
lation for the deviant behavior score js -95. This sar;e procedure has also 
been applied to other statistics of particular interest in this research 
including: a^) the proportion of the parent's generally "negative" responses 
(correct agreement = .97), b) the proportion of the parent's generally 
positive responses (corrected agreement = .98), c_) the median agreement 
coefficient of the 29 behavior codes observed for five or more children 
(corrected agreement = .9l), d) the median corrected agreement of the 11 
out of 15 deviant behavior codes used (r_ = .9l), e_) the number of parental 
commands given (corrected agreement = .9yy, and f^) the compliance ratio 
(i.e., compliances/compliances plus noncompliances) of the child (corrected 
agreement = .92). As our research is completed, we will be presenting 
observer agreement data using different statistics, computed in different 
vays, and evaluated by different criteria. 
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The primdry point of this section is to indicate that th^^rc arc many 
ways of calculating observer agreement data and there is no ono "right way 
to do it." The methods differ on three basic dimensions:" a) Mie nature 
and breadth of the dependent variable unit, b) the time span covered, and 
c) the method of computing the index. Each investigator must make his own 
decisions on each of these three points in line with the purpooes of his 
investigation. But, the investigator should be guided by one central 
prescription-~ the agreement data should be computed on the score used as the 
dependent variable . It makes no sense to report overall average agreement 
data (except perhaps as a bow to tradition) when the dependent variable 
is '^deviant behavior rate." In addition, it makes little sense to make 
the agreement criteria relative to time span more stringent than necessary. 
If the dependent variable is overall rate of deviant behavior for 3 five- 
day period, then this is the statistic for which agreement should be com- 
puted. It is not necessary for this limited purpose that both observers 
see the same deviant behavior in the same brief time block. 

Before closing this section on the computation of observer agreement, 
we should address the somewhat unanswerable question of the minimum criteria 
for the acceptability of observer agreement data. In other words, how much 
agreement is sufficient for moving on to consider the results of a particular 
study. When using observer agreement percent, it would seem reasonable, at 
the very minimum, to show that the agreement percent is greater than that 
which could be expect -d by chance alone. When dealing with correlation data, 
one should at least show the obtained correlation to be statistically signi- 
ficant. These criteria are, of course, extremely mininial and certainly far 
below those criteria commonly used in traditional testing and measurement 
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to estal^lish reliability (e.^., r.iuUnrd, Iwb^). Yet, Lhe:,o rrit^MMd 
do provide a reasonable lowest levoi ^tdndard m<] thrre aro sornr very /ood 
reasons why we should not be overly conservative on this point. In tho 
first place, very complex codes, which may provide us with some of our 
most interesting findings, are very difficult to use with complete arcuracy. 
On the basis of our experience, and that of G. R. Patterson (personal communi- 
cation), we see an overall agreement percent of 80% to 85% as traditionally 
computed as a realistic upper limit for the kind of complex code we are 
us ing . 

Furthermore, to the exten . that less than perfect agreement represents 
only unsystematic error in the dependent variable, it cannot be considered 
a confounding variable accounting for p ositiv e results. Any positive fi ndinr 
which emerges in spite of a good deal of '*noise*' or error variance is probably 
a relatively strong effect . 

Low observer agreement does, however, have very important implications 
for negative results. This gets us back to the fundament=il principle that 
one can never prove the uull hypothesis. The more error in the measurement 
instrument, the greater the chance for failing to discover importctnt pheno- 
mena. Thus, just as with traditional test i-eliability , the lower the ob- 
server accuracy, the less confidence one can have in any n.^gative findings 
from the research. 

Observer Agreement and Accuracy II: 
Generalizability of Observer Agreement Data 
All of the preceding discussion on the calculation of observer agree- 
ment data relies on the assumption that the obtained estimates of agree- 
ment are generalizable to the remainder of the observers' data collection. 
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In most naturalistic behavioral re.s^Mrch, however, this dr> .urTi[>t ir^n c<innot 
go unchallenged and thin bring:, ug to our next, and largo Iv luble. 
methodological problem. To illustrate this pz-oblem, let u:. Lake the not 
untypical case of an investigator who trains his observers on a behavioral 
code until they meet the criterion of two consecutive observation sessions 
at 80% agreement or better. After completing this training, tiic investi- 
gator embarks on his research with no further assessment of oh orver agree- 
ment. There are three basic problems with this methodology which make the 
generalizabiliti^ of this agreement data extremely questionable. These 
problems are a_) the nonrandomness of the selected data points, b) the 
unrepresentativeness of the selected data points in terms of the time of the 
assessment, and c) the potential for the c^bserver's reactivity to being 
checked or watched* The first two problems may be rather easily solved In 
all naturalistic research, but the third problem represents quite a challenge 
to some forms of naturalistic observation. Let us explore these problems in 
more detail. The nonrandomness of selecting the last two ^'successful" 
observation sessions in a series for establishing a true estimate of 
agreement should be very obvious. It is not unlikely that, had the investi- 
gator obtained several additional agreement sessions, he would find the 
average agreement figure to be lower than 80%. It is quite possible that 
our observers had, by chance, two consecutive "good days" which are highly 
unrepresentative of the days to come. One can almost visualize our hypo- 
thetical investigator, after the first day of highly accurate observation, 
saying to his observers, "That was really a good one; all we need is one 
more good session and we can begin the study." But, now we are getting 
into problems two and three. 
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The second problem of unreprc::ontativenoss in tern^s of tinio r<ii\ 
viously been discussed by C impbcll and oLanlcy (1966) and iaDoled in tr urr.ont. 
decay . That is, estimat*"'. of observer accurary obtained oru) week muv not be 
representative of obsrr ^^r accuracy the next week. The longer the re:.(^arch 
lasts, the greater is the potential problem of instrument decay. In twx: 
case of human observers, the decay may result from processes of forgetting;, 
new learning, fatigue, etc. Thus, because of instrument decay , our investi- 
gator's estimate of 80% agreement is probal^Jy an exaggeration of the rrue 
agreement during the study itself. The problem of instrument decay is al^^o 
often compounded by "be fact that during observer trainin^:;, there is usually 
a great deal of intense and concentrated work with the code, coupled with 
extensive training and feedback concerning observer accuracy. This inten- 
sity of experience and feedback is usually not maintained throughout the 
course of the research, and, as a result, the two time periods are charac- 
terized by very different sets of experiences for the observers. The third 
problem of generalizability of this agreerrent data involves the simple fact 
that people often do a better, or at least a different, job when they are 
aware of being watched as opposed to when they are not. Campbell and 
Stanley (1966) have labeled this problem r eactive effects of testing . It 
is likely that, when observers are being "tested" for accuracy, they will 
have heightened motivation for accuracy and heightened vigilance for criti- 
cal behaviors or for the coding peculiarities of their calibrator. This 
point has been brought home dramatically to us on more than one occasion 
by the tears of an observer after earning a particularly low agreement 
rating. Thus, because of the reactivity problem, estimates of observer 
agreement obtained with the awareness of the observer are likely to over- 
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estimate the true agreement level which would be obtained if tho ohsorvor 
were not aw=\re of such calibration. 

Fortunately, all of the preceding logical arguments ha\e been investi- 
gated in some recent research largely contributed by John Roid of the Oregon 
Research Institute. In his first published study on this problem, Reid 
(1970) designed a study which, from the observer's point of view, was almost 
identical to the hypothetical example given earlier • In this vStucy, ob- 
servers were trained for a maximum of eight days on complex behrivioral coding 
of videotape material in a laboratory setting. They were trained until thoy 
obtained 70% agreement on two consecutive days for 15 minutes of obs^^rvation 
per day. In each training session, feedback was given to the observorr. 
concerning their accuracy, but in the immediately following assessment .ses- 
sion no such feedback was given. After training was completed, the observers 
were led to believe that they would be the only ones seeing the nev; tapes 
because criterion protocols had not been previously constructed and 
the tapes would have to be erased immediately after coding so that they 
might be reused. Observers were urged, therefore, to be very careful in 
their subsequent coding. On each of four consecutive days after reaching 
criterion, the observers were asked to complete 30 minutes of observation 
on similar stimulus materials. The fi.^st 5 minutes of each session served 
as a practice period in which the experimenter gave feedback to the ob- 
server when necessary. Although the observers were not aware of it, 
criterion protocols had been prepared on these videotape materials, thus 
providing a covert assessment of observer occuracy. Throughout the study, 
observer accuracy was computed by comparing the data entry with the stan- 
dard protocol. Both errors of commission aiid omission were counted. This 
is the type of overall and generally stringent accuracy criteria which we 
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have earlier recomniend('d ogainr.t for rnont vosoarch purpoG^^j. . .inv ' V^^nt, 
ail observers reached the 10% apreoment criterion at or before- Irht ciavs 
of training and the average agreerr»ont during the two criterion ilays wa^ 
75.50%. The average agreement during the following four days, however, was 
only 51.25%, representing a drop of nearly 25 percentage points. Now it is 
obvious that, in this first study, all three confounding probloins wore 
present— a) unrepr entativeness of the criterion days, b) inr>trumont docay 
due to passage of time, etc., and c) reactivity to being tested. Th ' is 
the procedure, however, which is repr- tentative of many invostifv^ti^n: In 
the field. Reid's study certainly de„ionstrateb the potentially poworfnl 
impact of these confounding influences. Because of the relatively snort 
time span involved in this study, it is likely that most of this ef feel- 
is attributable to the reactivity problem. 

A more recent study by Taplin and Reid (in preparation) sliods greater 
light on the contribution of eadi of these confounding influences. Al- 
though some of the details of this latter study differed fron^ the earlier 
investigation, the basic procedures and concepts involved were the same. 
After training to criterion of 70% agreement, the observers were assigned 
to one of three groups: a_) No-check group — the procedures in this group 
essentially replicated those of the former study such t\at the observers 
were unaware of agreement assessment beyond the criterion period, b) 
Spot-check group~-in this group, observers were told that their accuracy 
would be checked from time to time d>iring the experimentaJ phase, but that 
they would be aware of when these checks would be made. This group controls 
for the problem of unrepresentativeness of criterion day performance as 
well as instrument decay, but does not control for reactivity to testing, 
c) Random-check group — in this group, the observers were told that their 
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accuracy would be checked randomly in 20% of the sessions but th^a thoy 
would not be aware of when these checks occurred. This proceduro not only 
controls for the first two problems, but also controls for the redctivity 
problem by making it constant . In other words , it is likely that the know- 
ledge of a "potential" assessment of accuracy will increase the ol'server's 
'notivation and vigilance. Yet, this is not a confounding influenc? since 
any random assessment of accuracy in the series will not be unrepresenta- 
tive^ of the other points in the series. Thus, whenever practically 
possible, this strategy of random assessment without observer awirono^s or 
ass3ssment on every observation occasion is most consistent with ri^vorous 
research requirements. The results of the Taplin and Reid (in propciration ) 
study are presented in Figure 1 and are of particular interest 'lere for 
the way in which they illustrate the operation of the three confounding 



problems. The discrepancy between percent agreement on the criterion day:> 
and on the days of data collection is obvious and significant for all groups. 
Thus, the use of this kind of previously collected criterion d^jta for 
generalization to subsequent data coll^iction appears highly in j»'propriate , 
regardless of the method of data collection or the rationale pi ovided 
observers during the data collection period. In all groups, the confounds 
of unrepresentativeness of criterion day data and instrument decay from 
the criterion days are obvious. Now, the observer accuracy data produced 
on the spot check days are not unrepresentative, in the sense th-it these 
days were presumably not chosen initially as particularly "good" ones, as 
were the criterion days. In addition, this procedure controls for the 



Insert Figure 1 about here 
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problems of instrument d^cay due to pd.ssaf.o of time since i.*)*^ ^h'^ci- - ' "'r< 

randomly throughout the poriod of d.Ua colloction. It is cloru tluw H 
"spot check" accuracy assessment piX)duco(i less overestimdtion oi iru*- iocur- H 
acy than did the criterion day estimates. The dat-. in thii^ "spot ch»'ck" B 
group is very clear, however, in illustrating the problem of ob-.^M^vor r^^ic- H 
tivity. The agreement porcents in the two spot check periods aro c](virlv 
overestimates of agreement perconts obtained under covert assosstM^nr. ihM. , 
the "spot check" method reprfjsents a decidod improvement over tho criterion 
score method, but it is an imperfect ::olution. 

In some naturalistic settin^^s, th^ "rpot check" im^thod may h<iv'^ t(. 
remain the optimal solution for rea'>ons of cost ai the potonti<il for th^ 
reactive effects inherent in u- ing nultipl*^ ol sorv'--rr> cor:t inuour. iv {o,^\,^ 
in observations in homps or other confined quartor ). As previoir ly -tatcd, 

the methods of total accuracy assessment (<*.p.., as u. od by Wahler, l u/;, and H 
Browning C Stover, 1971) or random accur-ac/ .i: ' ssirM-ut without .twn.:.*. . 

(as in Taplin ^ Roid, in preparation) arc ilways pn-forable when peril lo. |H 
These methods ar^ , of course, particularly simple to apply with video or 

audio tape materials or in natural settings wh( tw^ or more obscrvr-r ^| 
are, for whatever reason, employed simultaneously and continuously. In B[ 
classrooms, for example, it is often the case that two or more observers H 
record the behaviors of two or more childr<-n. Under these circumstancos , 

the investigator can arrange the observers* recording schedules .,o that their H 
observation of subjects overlap at random ximes. In this way, two obsorv^^rs 

can record the behavior of the same subject at the same tine without either H 
having knowledge of the ongoing calibrdtion for agreement which is occurring H 
at that specific time. This procedure would replicate the "random check H 
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group'* of Taplin and Reid (in prcpardtion) in a field settinp,. It would 
probably be difficult, if not impor;>Gible , to keep the fact of rondom cali- 
bration a secret from the observers for any extended period, but, as stated 
earlier, this is no real problem, because the randomly collected data with- 
out specif ic awareness is representative of accuracy at other times. The 
Taplin and Reid (in preparation) data would suggest that the motivational 
effects of informing observers of the random checks slightly increar.c^s the 
level and stability of their accuracy scores. (Compare the three groups* 
accuracy level and stability in the data collection period in Fir.uro ].) 

In more recent research, Reid and his colleagues have directf^d inoir 
efforts to finding ways of eliminatint^, the instrument decay or ^'oiKservor 
drift*' observed in all previous studios rc,^ardless of the method of r.oni- 
toring. In several long-term research projects, including our own (e.g., 
Johnson, Wahl, Martin G Johansson, 1972), the one directed by r,. . 
Patterson (e,g., Patterson, Cobb, S Ray, 1<)72) and the one rcporiod by 
Browning and Stover (1971), continuous training, discujjsion of the codinj^ 
system, and accuracy feedback are provided for the observe^-^. It is possible 
that this kind of training and feedl-^ack could eliminate, oj £it least atten- 
uate, observers* accuracy drift ?.s well as the problem of the unrepre.senta- 
tiveness of **spot check*' accuracy assessments. To test this iiypothesis, 
DeMaster and Reid (in preparation) designed a study in which free- levels 
of feedback and training during data collection were compar-ed .-n a sample of 
28 observers. The observers were divided into lU pairs and ali subsequent 
procedures were carried out in the context of these fixed pairs. The three 
experimental groups were as follows: Group I-~Total reedback--In this group 
observers a) discussed their observation performance together while reviewing 
their coding of the previous day*s video tape, b) discussed their previous 
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d<>y*s observation with the oxporini-nter in t(^rms of their ^iprooj^.* nt witn rho 
criterion coded protocol, and c) r^v^ ived s driily roport of th* ir <iccin-K 
with respect to th^^ criterion protocol. Group II--I*air- Af.rocncr.t r<'(vi: ><ick-- 
In this group, observers were given ti.o opportunity to discuss thoir [H^rlorm- 
ance as in a above and b)were given a daily rcpor^ on the oxtent to which, 
each observer's coding protocol agreed wiin the protocol of the r.thf>r- ob- 
server. Subjects in this group were deprived of a discussion or repcyrt of 
their level of agreement with the criterion protocols. Group III--No 
Feedl^ack- -Subjects in ttiis group were deprived of the kindr, of I'-odlvirl^ 
given in the previous two conditions aiid were instructed nc t to discu:>:, 
their work among themselves to eliminate a possible "bias of the data." 
This group was similar in conci:j)t to the random- check group in the T<ip]in 
and Reid (in preparation) study in that they wore tojd, as wen- <tlj oth«r 
.'vubjects, that their accuracy would Ke chf'C:J<od at random interv.il'. in tuc 
data collection period. The 'lepondent variables wer-e a) the agr^^jnient cores 
between pai^s of vjDservers rind :>) the "accuracy" .scores reflected by thr- 
percent agreement with the criterion protocols. The results showed that the 
intra-pair observer agreement scores were significantly hi^;her than wore 
scores reflecting agreement 'vith the* criterion. These rer^ults tend to 
corroborate the hypothesis foivaixded by Baer, Wolt, and Risley (1^68) u!id 
Bijou, Peterson and Ault (1^)68) that high intra-pair agreemr^nt <^ < not 
necessarily reflect proper use of the coding system. We cUiall call u.lc 
problem "consensual observer drift." It is very important to note, :iOwevi>r, 
that the design of this study which placed observers in liyec ind unchanging 
pairs would tend to maximize this effect. In the field studies referred to 
above, observers typically meet in larger groups for training and feedback 
and observers rotate in calibrating each other's observations. Under these 
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circumstances, the effects of consensual drift would logically be expected 
to be less potent. Indeed, further data from the DeMaster and Reid (in 
preparation) study lends support to this argument. On th^se video-tape 
materials where more than one pair of observers had coded the sequr-c? 
the investigators compared the fixed pair agreement with the agreement 
between observers in other pairs. In all cases, the fixed pair agreed more 
with one another than they did with the observers in the other pairs. Thus, 
this idiosyncratic drift of fixed pairs may be greater than drift exper- 
ienced under currently employed field research procedures- Yet, a recent 
study by Romanczyk , Kent, Diament, and 0*Leary (1971) showed that during 
overt agreement assessment observers would change their coding behavior 
to more closely approximate the differential coding styles of their cali- 
brators. Thus, it is possible for observers to produce one kind of consen- 
sual drift with some calibrators and an opposite consensual drift with others 
to yield artifically high observer agreement data. 

The manipulations in the Romanczyk et al. (197.^) study were quite 
powerful, however, and one can question the generalizability of these 
artificially induced conditions to real field studies. Nevertheless, this 
study does demonstrate the potential for powerful and differential consensual 
drift. In spite of those considerations, one must realize that it is im- 
possible in an ongoing field observation to hdve a ''pure" criterion protocol, 
since one cannot arbitrarily designate one observer's protocol as the "true" 
criterion and the other as the imperfect approximr.e. But, one can atten- 
uate this problem considerably by having frequent training sessions with 
observers on pre-coded video-tape material or on pre-coded behavioral 
scripts wliich may be acted out live by paid subjects. The importance of 
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this recommendation is underlined by DeMaster and Reid^s (in •Pir':ti<>n) 
second important finding. Analysis of the data indicated a oi nificant main 
effect for feedback conditions, with the total feedback group doing best, 
followed by the intra-pair feedback group and the no feedback group, 
respectively . 

It may be of inte5:^est to review briefly how our ovm project stacks up 
with regard to these considerations and to suggest ways in whicn it and 
similar projects might be improved in this area. Initial observer training 
in our laboratory consists of the following program: a) r^eadinp, and study 
of the observation manual, b^) completion of programmed instruction material^ 
involving preceded interactions , •£> participation in daily inti^noive training 
sessions which include discussion of the system and coding of precodei 
scripts which are acted out live by paid but nonprofessional actors, d) 
field training with a more experienced observer followed immediately by 
agieement checks. Currently, when an observer obtains five sessions with 
an average overall percent agreement of 10% or better, she may begin regular 
observation without constant monitoring. All observers continue to partici- 
pate in continuous training and are subject to continuous checking with 
feedback. This is accomplished in two ways. First, each observer is 
subject to one spot-check calibration for each family she observes. This 
calibration may come on any one of the re^jilar Tive days of observation. 
Beth observers figure their percent agreement in the traditional way 
immediately after the session and discuss their disagreements at thic time. 
If they cannot resolve their disagreement on a particular or idiosyncratic 
problem, they call the observer crainer immediately who serves as sort of an 
imperfect criterion coder. From time to time, idiosyncratic problems arise 
which cannot be resolved by the coding manual alone. Decisions on how to 
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code thes2 special cases are made by the group and the trainer and are 
entered in a ''decision log" which is periodically studied by all observers. 
These special circumstances are unfortunate and provide an opportunity for 
consensual drift, but are part of the reality with which we must deal. The 
"decision log" helps attenuate the drift problem on these decisions, and most 
of them tend to be idiosyncratic to one or two families. The second aspect 
of continual training involves a minimum of one 90-minute training session 
per week for all observers in/olving discussion and live coding experience. 
We have been negligent in our procedures in not retaining our precoded 
scripts over time and recoding these from month to month and year to year. 
On the basis of our review of Reid*s excellent work, we have now begun to 
correct this error by retaining these scripts and subjecting them to recoding 
periodically to check the problem of "consensual observer drift." As will 
be obvious, we use the imperfect method of "spot check" calibration for 
observer agreement, but Reid's data is encouraging in that it indicates that 
the kind of intensive and continual training outlined here may attenuate the 
problems associated with this method. Furthermore, our observers are con- 
vinced that calibration scores obtained on a single day of observation are 
probably lower than would be obtained over two or more days of observation. 
The reason for this belief is that the calibrator would logically have more 
difficulty in adapting to each new home environment and identifying the 
subjects of observation on the first day in the home than on subsequent 
days, unfortunately, we have no hard data to prove this hypothesis, but we 
have begun to do more than one day of calibration on families in order to 
test it. 

The problem of cons<^nsual drift is also attenuated in this project by 



ERIC 



2^ 



Johnson and Bo Is tad 27 
the practice of having each observer calibrate all other ohservors. ^e 
recently began to employ only one calibi'ator for reasons of convenience and 
cost, but this review has persuaded us to return, at least partially, to 
multiple calibration among all observers. 

As stated earlier, the problems associated with reactivity to testing 
for observer agreement could largely be solved by procedures which involved 
coding of audio or video tapes. This is true because one could arrange 
calibration on a random basis without observer awareness. Because proce- 
dures of this kind could also solve or attenuate problems of observer bias 
and subject reactivity, we are beginning to consider procedures of this type 
more seriously for future research and are now involved in pilot work on 
the feasibility of these methods. Short of this, we must be content with the 
"spot check" method as outlined and attempr to attenuate the problems asso- 
ciated with this method by use of extensive training and feedback as 
suggested by DeMaster and Reid (in preparation). 

Reliability of Naturalistic Behavioral Data 
One must look long and hard through the behavior modification literature 
to find even an example of reliability data on naturalistic behavior 
rate scores. In classical test theory, the concept of reliability involves 
the consistency with which a test measures a given attribute or yields a 
consistent score on a given dimension. Theoretically, a test of intelli- 
gence, for example, is reliable if it consistently yields highly similar 
scores for the same individual relative to other individuals in the sample. 
There are several approaches to measuring reliability including split-half 
measures, equivalent forms, test-retest methods, etc. Each method nas a 
somewhat different meaning, but the basic objective of each is an estimate 
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of the consistency of measuromen-*-. It is difficult to tell whether ienavior- 
ists have simply neglected, or deliberately rejected, the reliability re- 
quirement for their own research. The concept comes out of classical test 
theory and is obviously allied to trait concepts of personality, Behavior- 
ists may feel that the concept is irrele/ant to their purpose^.. After all, 
we know that there is often very little proven consistency in human behavior 
over time and stimulus situations (e,g., see Mischel, 1968), no why should 
we require a consistency in our measurement instruments thax is not present 
in real life? Behaviorists may feel that reliability is an ou^riodcd con- 
cept and belongs exclusively to the era of trait psychology. 'f this is, 
in fact, the reason for the neglect of the reliability issue in behavioral 
research, it represents a serious conceptual error and a clear misapplication 
of the meaning of the data on the a.ack of behavioral consistency so elo- 
quently summarized by Mischel (1968). It is true, of course, that behav- 
iorists employ more restricted definitions of the topography of the relevant 
response dimensions (e.g., hitting vs. aggression) and that they often in- 
clude more restrictive stimulus events in defining these dimensions (e.g., 
child noncompliance to mother's commands vs. child negativism). Yet, the 
fact remains that we are still dealing with scores that reflect behavioral 
dimensions. If the word "trait" offends, then another label will do as 
well. Furthermore, the scores are obtained for the same purposes that trait 
scores are obtained — to correlate with some other variable. Generally, 
behavior modifiers "correlate" these scores with the presence or absence 
of some treatment procedure but certainly our data is not limited to this one 
objective. In our own research, for example, we are cui^rently comparing 
children's deviant behavior .^ates in their homes with their deviancy in the 
school classroom (Walker, Johnson, 6 Hops, 1972) and comparing the deviancy 
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ratos of nornidl childron vn tli those obr.orved in referred or "dovi/jnt" ' 
children (Lobitz John:>on, Thr morU elemontary knowledr/' ol Mh^ 

concept of reliability tells ur. th<il some minimal level of behavior ;core 
reliability is necessary before we can ever hope to obtain any significrnt 
relationship between our behavioral score and any external variable. Thus, 
the requirement of score reliability is just as important in resoarcr. 
employing behavioral assessment as it is in more traditional forms of psy- 
cholop,ical assessment, but with only a few exceptions (e.g., Cobb, l^f/} ; 
Harris, 1969 ; Olson, 1930-31; Pattornon, Cobb, & Ray, 1972) behaviori;,t s 
have ignored this important issue. 

As a consequence of the reasoning presented above, wo have been par- 
ticularly cognizant of the reliability of the scores used in our res(Mrch. 
We were quite encouraged to find, for example, that the odd-even-Gpli ^ -^lalf 
reliability of our "total deviant behavior score" in a sample of 33 "normal" 
children was .72. This reli.ability was computed by correlating the total 
deviant behavior score obtained on the first, third, and first half of the 
fiftlj day with the same score obtained from the remainder of the period. 
After applying the Spearman-Brov/n correction formula, we found mat the 
reliability of this score for the entire five-day ooservation period was .03. 
This relatively high level of reliability indicat^^^ that this score should, 
at jeast in c statistical sense, be quite sensitive to manipulation or to 
true relationships with other external variables (e.g., social class, or 
educational level of the parents). Other behavioral scores which are im- 
portant to our research include: a_) the proportion of generally negative 
responses of the parents (corrected reliability = .90), b) the proportion 
of generally positive responses of parents (corrected reliability - .87), 
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c) the median reliability of the 35 individual codes (r = .69), d) the 
corrected median reliability of the deviant codes = .66, 2) ^he number of 
parental commands during the observation (corrected reliability = .85), and 
£_) the compliance ratio (i.e., compliances/compliances + noncompliance) of 
the child (corrected reliability = . H9) . The reliability of the compliance 
ratio is not as high as' we might have wished, but it may still be high enough 
to be sensitive enough for powerful manipulations. We have been less for- 
tunate in obtaining good reliability scores on some other statistics import- 
ant to our research efforts. For example, the compliance ratios to specific 
agents (i.e., to mothers or fathers) have yielded rather low reliabilities. 
The reasons for this are two- fold: First, ratio scores are always less re- 
liable than are their componant raw scores, because they combine the error 
variance of both components. Second, and of more general importance, 
these scores are based on relatively few occurrences. On the average, for 
example, fathers give only 36 commands over the five-day period. These 
occurrences must then be divided for the compliances and noncompliances and 
further split in half for the odd-even reliability estimate. By the time 
this erosion takes place, there are few data points on which to base re- 
liability estimates. This problem is even more profound when we use one day 
of compliance ratio data to compute observer agreement on this statistic, 
since, on the average, fathers give only 7.2 commands per day. Thus, when 
we are dealing with behavioral events of fairly low base rate, observer 
agreement correlations and reliability coefficients may often not be 
"fairly" computed because there is simply not enough data. In classical 
test theory terminolr /, there may often not be enough "items" on the be- 
havioral test to permit an accurate estimation of the reliability of the 
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score. What should we do with cases of this kind? A methodological purist 
might argue that we should throw out this data and use only scores with 
proven high reliability and observer agreement. We would argu that this 
course would be a particularly unfortunate solution for several reasons. 
First, low base rate behaviors are often those of special importance in 
clinical work. Second, if low reliability reflects nothing more than 
random, unsystematic error 'n the measurement instrument, it cannot jeopar- 
dize or provide a confounding influence on positive results (i.e. , it cannot 
contribute to the commission of Type I errors). But, either low reliability 
or low observer agreement does have profound implications for the niedninp, 
of negative results (i.e. , the commission of Type II errors). Fortunately, 
the effects of many behavior modification procedures are so dramatic that 
they will emerge significant in spite of relatively low relial:)ili ty or 
observer accuracy. 

In one of the other few examples of reliability data in the behavior 
modification literature, Cobb (1969) found that the average odd-even re- 
liability of relevant behavioral codes used in the school setting was only 
.72. Yet, Cobb (1969) found that the rates of certain coded behaviors 
showed strong relationships to achievement in arithmetic. Thus, relatively 
low reliability or observer agreement jeopardizes very little the meaning 
of positi ve results, but loaves negative results with little meaning. 
There is, however, one very critical qualifying point to this argument. It 
is that the error expressed in low reliability or observer accuracy must 
be random, unsystematic, and unbiased. With this consideration in mind, we 
now move to what are perhaps the most important methodological issues in 
naturalistic research — observer bias and observes reactivity to the obser- 
vation process. 
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The Problem of Obscrvor Biary in Naturalistic Observation 



Shortly after the turn of the century, 0. Pfungst bec<ime intrip,uod 
with a mysteriously clever horse named Hans. By tapping his foot, "Clever 
Hans*' was able to add, subtract, multiply and divide and to spell, read, 
and solve problems of musical harmony (Pfungst, 1911). Hans' ,wner, a 
Mr. von Osten, was a German mathematics teacher who, unlik*-^ the vaudeville 
trainers of show animals, did not profit from the horse's peculiai^ talents. 
He insisted that he did not cue the animal and, as proof, he permitted 
others to question Hans without his being present. Pfungst ron^-'^ined in- 
c3?edulous and began d ^ rogram of systematic study to unravel tno mystery 
of Hans' talents. 

Pfungst soon discovered thot, if the horse could not see the questioner, 
Hans could not even answer the simplest of questions. Neither would lians 
respond if the questioner himself did not know the answer. Pfungst next 
observed that a forward inclination of the questioner's head was sufficient 
to start the horse tapping, and raising the head was sufficient to terminate 
the tapping. This was true even for very slighr motions of the head, as 
well as the lowering and raising of the eyebrows and the dilation and con- 
traction of the questioner's nostrils. 

Pfungst reasoned and demonstrated that Hans' questioners, even the 
skeptical ones, expected the horse to give correct responses. Unwittingly, 
their expectations were reflected in their head movements and glances tc 
and from the horse's hooves. When the correct number of hoof taps was 
reached, the questioners almost always looked up, thereby signaling Hans 
to stop (Rosenthal, 1966 K 

Some fifty years later, Robert Rosenthal began to investigate the 
importance of the expectations of experimenters in psychological research. 
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In his now classical article, Rosenthal (1963) presented evi- 
dence suggesting that the experimentej^^s Knowledge of tho hypothesis could 
serve as an unintended source of varir-ince in experimental results. In a 
prototypical study, Rosenthal and Fode (1963) had naive rats randomly 
assigned to two groups of undergra^Juate experimenters in a maze-learning 
task. One group of experimenters was told that they were working with maze- 
bright animals and the other group was told that their rats were maze-dull. 
The group of experiment'^rs which was led to believe that their rats were 
mazC'-bright reported faster learning times for their subjects than the 
group which was told their animals wer*e maze-dull. An extension of this 
finding to the classroom was offered by Rosenthal and Jacohson (1966). 
Teachers were led to believe that certain > randomly selected students in 
their classrooms were "late bloomers*' with unrealized academic potential. 
Pre- and post-testing in the fall and spring suggested that children in tne 
experimental group (late bloomers) had a greater increase in IQ than did 
the controls . 

The purpose of this section ^iH to exainiiiO the problem of experi- 
menter-observer bias with regard to naturalistic observational procedures. 
The amount of literature which deals directly with observer bias in 
naturalistic observation is sparse (Kass ^ O'Leary, 1970; Skindrud, 1972; 
Kent, 1972). \ /er, Rosenthal has written an extensive review of experi- 
menter bias in behavioral and social psychological research (Rosenthal, 1966). 
In spite of failures to replicate many of Rosenthal's findings (Barber 6 
Silver, 1968; Clairborn, 1969) and extensive criticisms of Rosenthal's 
methodology (Snow, 1969; Thorndike , 1969 1 Barber 6 Silver, x968), the massive 
body of literature compiled and summarized by Rosenthal (1966) remains the 
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best available resource for conceptualizing the phenomenon of obnorv^r bias 
and for isolating possible sourcen of bias relevant to naturalistic obner- 
vation. A brief review of this literature follows with a locus on inte- 
grating implications from this literature with naturalistic observational 
procedures. In addition, we will give consideration to the few experiments 
which have directly investigated observer bias in naturalistic observation 
and further consider some proposals for dxperiments yet to be conducted. 
Finally, suggestions for minimizing observer bias will be outlined and data 
O'l this problem from our laboratory will be presented. 

Conceptualization of Observer Bias 

Rosenthal (1966) has defined experimenter bias "a.s the extent to which 
experimenter effect or error is asymmetrically distributed about the 
'correct* or 'true' value." Observer errors or effects are generally 
assumed to be randomly distributed around a "true" or "criterion" value. 
Observer bias, on the other hand, tends to be unidirectional and thereby 
confounding. 

Sources of Observer Bias 

An important distinction should be drawn between observer error and 
observer effect on subjects. Invalid results may be contributed solely by 
systematic or "biased" errors in recording by observers. Or, invalid find- 
ings may be realized as a result of the effect that the observer has on his 
subjects (Rosenthal, 1966). First we will consider recording error as a 
source of observer bias. 

Kennedy and Uphoff (1939) illustrate the problem of recording errors in 
an experiment in extrasensory perception. The observers' task was simply to 
record the investigator's guesses as to the kind of symbol being "trans- 



ERIC 




Johnr>on and Bo 1st rid 2 b 

mjttcd" by the observer. Since the investigdtors guesses for 'so oh:,<»rvors 
had been programmed, it was possible to count the number of ro ordin^ c-rrors. 
In all, 126 recording errors out of xl,125 guesses were accumulated among 
28 observers. The analysis of errors revealed that believers in telepathy 
made 71.5 percent more errors increasing telepathy scores xhan did non- 
believers. Disbelievers made 100 p'^rcent more errors decreasing the 
telepathy scores than did their counterparts. Sheffield and Kaufman (1952) 
found similar biases in recording errors among believers and nonhelievcrs 
in psychokinesis on tallying the results of the fal3 of dice. Computational 
errors in summing recorded rates have also been documented by RosenthaJ in 
an experiment on the perception of people (Rosenthal, Friedman, Johnson, 
Fode, Schill, White, & Vikan-Kline, 196u). 

It is doubtful that these recox^ding and computat ionaJ errors wore in- 
tentional, liowever, as Rosenthal (1966, p. 31-32) notes, data fabrication 
or intentional cheating is not absent in psychological research, especially 
where undergraduate student eixperimenters are employed as data collectors. 
Rosenthal points out that these students "have usually not identified t a 
great extent with the scientific values of their instructors." Students 
may fear that a poor grade will be the result of an accurately observed 
and recorded event which is incompatible with the expected event. Of two 
experiments by Rosenthal which were designed to examine intentional erring 
by students in a laboratory course in aniiial learning, one revealed a clear 
instance of data fabrication (Rosenthal & Lawson, 196u) and the other showed 
no evidence of intentional erring but did show some deviations from the pre- 
scribed procedure (Rosenthal & Fode, 1963). Another study employing student 
experimenters by Azrin, Holz, Ulrich, and Goldiamond (1961) replicated 
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Verplanck*s (l955) verbal conditioning experiment. However, an informal 
post-experimental check reveajed that data had been fabricated by the student 
experimenters* Later, the authors employed advanced graduate students as 
experimenters and found that Verplcinck's results were not replicated. 

The implications for naturalistic observation are obvious. Observer 
error, whether it be unintentional or in'jentional , incurred during recording 
or during computation, must be guarded against by accuracy checks and by 
carefully concealing the experimenter's hypotheses. Although observer 
agreement checks do not rule out the possibility of bias among the ob- 
servers whose data is compared, it at least arouses suspicion where ap.ree- 
ment figures are low and disagreements pre consistent. Ideally, observers 
should not be made responsible for the tallying of their own data. Compu- 
tations should be made by a nonobserver who is removed from knowledge of 
the observations. Observers should be selected on the basis of their iden- 
tification with scientific integrity anc] admonitions against p< ssible 
biasing effects should be repeated during the course of the ex->eriment. 
Finally, observers should be encouraged to disclose to the experimenter 
both ohe nature and sources of any information they receive that might be 
relevant to the objectivity of their obsei^vations , A questionnaire, filled 
out after observation sessions, can facilitate this disclosure. 

The other source of observer bias, which Rosenthal discusses (Rosenthal, 
1966), is the effect of the observer's expectancy on the subject. If an 
observer has an hypothesis about a subject's behavior, he may be able to 
connnunicate his expectations and thereby influence the behavior. 

Expectancy effects have previously been alluded to in Rosenthal's 
study with animal laboratory experimenters (Rosenthal & Fode, 1963) and 
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teachers in the classroom (Rosenthal S Jacobson, 1966). Rosenthal's first 
major study in expectancy effects iis instructive in its simplicity. For^on- 
thal and Fode (1963) had 10 experimenters obtain ratings from 20G su) iocts 
on the photo person-perception task. All 10 experimenters received id*^nti- 
cal instructions except that five experimenters were informed that thoir 
subjects would probably average a +5 success rating on the ten neutral 
photos while the other five experiiienters were led to expect a -5 f-iilure 
average. The results revealed that the group given the +5 expectation ob- 
tained an average of +.40 vs. the -5 expectation group v/hich yielded a 
-.08 score. These differences were highly significant and subsequent repli- 
cations have supported these findings (Fode, 1960; i ode , 1965). 

The implications for naturalistic observational procedures of the ex- 
pectancy effect on the subject's behavior are mosL oiscomrortinr.. U , 'i - 
in the Rosenthal laboratory studies, observers in th(* natural sotting. ^:an 
communicate their expectancies to their subjects such that the j,ul)ject's 
behavior falls in line with those expectations, a serious threat to internal 
validity is posed. Assuming that humans are no less sensitive to subtle 
cues than Mr. von Osten's Clever Hans, it seems reasonable to infer that 
observer expectancy effects are operative in the natural setting. Consider 
the not atypical case of an observer who records selected deviant behaviors 
of a child in a classroom before, during, and after treatment. Seldom is 
it not obvious to the observer when treatment begins and ends. Assuming that 
an observer might infer the expectations of the experimenter in such a 
setting, how might he communicate these e/.pectations to his subjects? One 
way of influencing the targe'^ed child is by nonverbal expressive cues. 
Expressions of amusement by the observer during baseline might inflate 
deviant behaviors. During intervention, expressions of disapproval or 
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caution by the observer might reduce the subject's deviant rate. These 
biasing effects may be systematic and confounding. 

Although few studies have systematically assessed the effects of ob- 
server bias in the natural setting, many field investigators have taken note 
of the expectancy phenomenon, and have included procedures to minimize 
its effect. One such technique is to mask changes in experimental conditions 
(e.g., Thomas, Becker, S Armstrong, 1968). Another is to keep observers 
unaware of assignment of subjects to various treatment or control conditions 
(e.g., O'Conner, 1969). The addition of new observers in the last phase 
of a study who are naive to previous manipulations is another approach (e.g., 
Bolstad d Johnson, 1972). 

Three studies in the natural setting shed further light on expectancy 
effects with naturalistic observational procedures. Rapp (1966) ha^ eight 
pairs of untrained observers describe a child in a nursery school for a 
period of one minute. One member of each observer pair was subtly informed 
that the child under observation was feeling "under par" that day and the 
other that the child was "above pc»r." In fact,* all eight children showed 
no such behaviors. Seven of the eight pairs of observers evidenced signi- 
ficant discrepancies between partners in their description of the nursery 
children in the direction of thei r respective expectations. Both recording 
exT:>rs and expectancy effects on the subjects' behavior may have contrib- 
uted to this demonstration of observer bias. 

A second study by Azrin et al. (1961) employed untrained undergraduate 
observers who were asked to count opinion statements of adults when they 
spoke to them. The observations of those who had been exposed to an 
operant interpretation of the verbal conditioning phenomenon under study 
were the exact opposite of those given a psychodynamic interpretation. 
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Again, both the expectancy effects of the observer on the subject and re- 
cording errors may have accounted for the observer bias. Post experimental 
inquiries by an accomplice student revealed that recording errors wei e the 
main factor. The accomplice learned that 12 of the 19 undergraduates 
questioned intentionally fabricated their data to meet their expectations, 

A third study by Scott, Burton and Yarrow (1967) allows a comparison 
between the simultaneous observations of hypothesis informed (Scott her- 
self) and uninformed obcervers. The observers coded behd r into positive 
and negative acts from an audio-tape recording of the targeted child and 
his peers. The informed observer's data differed significantly from rhe 
others' in the direction of the experimenters' hypothesis. 

These three studies strongly suggest that data collected by relatively 
untrained observers are influenced by observer expectations. Do these 
findings generalize to the observations of professional observers who are 
highly trained in the use of sophisticated multivariate behavior codes? 
As indicated earlier, the amount of available research which ^irectly per- 
tains to this question is limited and somewhat equivocal. 

Kass and C'Leary (1970) conducted the first systematic attempt to 
manipulate observer expectations in a simulated field-experimental situation. 
Three groups of female undergraduates observed ider tical videotaped record- 
ings of two disruptive children in a simulated classroom. The observers 
were trained in nine category codes of disruptive behavior. Group I was 
then given the expectation that soft reprimands from the teacher would in- 
crease the rate of disruptive behavior. Group II was told that soft repri- 
mands would decrease disruptive behavior. And, Group III was given no ex- 
pectation at all about the effects of soft reprimands. Rationales were 
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given each group explaining the reasons for each specific oxpr tation. Th- 
effects of these expectations were assessed by having the observers watch 
four days of baseline and five days of treatment data. The interaction 
between the mean rate of disruptive behavior in th*j three conditions and 
the two treatment conditions was significant at the .005 level, indicating 
the presence of observer bias. Ronald Kent (1972) has 

suggested that these reported effects of expectation bias were confounded 
with observer drift in the accuracy of recording. When different groups of 
raters, who are interreliable within groups, fail to frequently compute 
agreement between groups, they may ''drift*' apart in their application of 
the behavioral code. However, if should be noted that when this drift, 
comprised of recording errors, is alligned asymmetrically in the direction 
of the expectation, then the drift is, by definition, observer bias. 

Skindrud (1972) attempted to replicate the findin^^s of Kass and O'Leary 
(1970). Observers were divided into three groups, each group given a different 
expectation about video-taped family int<^ractions . The first group was 
given the expectation that when the father was absent there would be more 
child deviant behaviors than when the facher was present. A second group 
was given the opposite expectation. Appropriate rationales were provided 
for each of these two groups. An additional control group was added with 
no expectations provided regarxiing father-present or father-absent tapes. 
All observers were checked at the end of training on the rates of deviant 
behaviors they recorded and subsequently matched on this variable when 
assigned to conditions. Throughout the study, observer agreement data was 
collected randomly. During training, reliability was checked daily, and the 
average observer agreement prior to the beginning of the manipulation was 6^1%. 
The results of the study gave no evidence for observer bias . There were no 
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significant differences between groups and no significant interaction effects. 
There was little drift in the accuracy with which the code w<i used, r.e- 
quential reliabilities were computed for the increase, decroise, and control 
groups with average observer accuracy of 08.5%, 57.6%, and 58. U%, respec- 



viously coded criterion protocols. The relatively small and nnistent 
decline in accuracy is consistent with the failure to find bi,. . 

A similar unsuccessful atten.pt to ^ cpJ i cate Kas? and O^L'^iry (1970) 
was reported by Kent (1972). Kent found that knowledge of predicted results 
was not sufficient to produce an obs<^rv( r i las effect. However, when 
the experimenter reacted positively to d<3t<i \/nich was consistent with the 
given predictions and negatively to nc^iisi .tcut data, a significant ob- 
server bias effect was obtained. 

The available literature dealin.; with oluu^rvcr bias in naturalistic 
observation is both sparse and con t tMd i r t ory . Furthermore, the few studies 
available have focused exclusively on only one sour^ce of observer bias, 
namely, recording errors or errors ot apprehension. Thus far, no one has 
systematically investigated the olivets oi the observer's expectancies on 
the subjects* behavior. In the thr^p studies reported above, all observa- 
tions were made from video-toped r''cording.5 . There were no opportunities 
for the observers to communicate the'r expectancies to their subjects. 
Yet, in most studies employinp naturalistic observational procedures, 
observers do have that opportunity. 

An important study which needs to be conducted is one which examines 
the observer's expectancy effects on the subject. First, it would be 
interesting to determine if observers could nonverbally communicate their 



tively. These accuracy figures were computed by comparisons with pre- 
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expectancies to subjects such that the subject's behavior changes in the 
direction of the expectancy. The next step, of course, would be to repli- 
cate this same design without specifically asking observers to attjmpt to 
influence subjects, but, merely to give them an expectation. 

Perhaps the most important test of observer bias effects will be the 
one which combines recording errors and effects of observer expectancy on 
subjects in the naturalistic settinj> One can question the generalizability 
of highly controlled laboratory studies to live observations and to research 
projects in which the observers are more invested in the outcome of the 
t^esearch. The generalizability oi s udi^-s which employ only taped versions 
of a subject's behavior is further limited by excluding the possible effects 
of an observer's expectar^cy on hit; subject' J behavior. 

Another variable which scorns c^njcial to observer bias in the natural- 
istic setting is the observer's re, pon.n voiioso to admonitions to remain 
scientific, objective, and impartial In \ho collection of data. Rosenthal 
(1966) stresses the importance of the experimenter-observer's identification 
with sciinice and objectivity. He cites evidence suggesting that graduate 
students obtain less biased data than undergraduates and interprets this 
difference as a function of identification with science. Perhaps observers 
who are repeatedly reminded to be impartial might be less susceptible to the 
influence of biasing information than observers not given these admonitions. 

A dimension which seems important in considering observer bias is the 
specificity of the code. In most of the Rosenthal literature, the dependent 
variable is scaled between such global poles as success and failure. in- 
tuitively, it seems logical that the more ambiguous the dependent measure, 
the greater the possibility for bias. A multivariate coding system, with 
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well-dfefined behavioral codes might be expected to restrict interpretive 
bias. This is an empirical question worthy of examination. 

Another variable which might greatly affect observer bias is observer 
agreement. The greater the observer agreement, the less likely is observer 
bias, even among observers with the same expectancy. 

Unt?l more information is available on observer bias effects in natural- 
istic observaMon, it seems very critical to do everything possible to mini- 
mize the potential for these effects. Whenever possible, observers should not 
have access to information that may give rise to confounding consequences 
and encouraged to reveal the nature and source of any information they do 
receive. In our research, we are currently observing both families in 
clinical treatment and '^normal" or nontreated families. Knowledge of a 
family's status might seriously affect the ooserver's data. Also, knov^ledge 
about treatment stages (baseline, mi^i-trecitmcnL . post-treatment, and follow- 
up) might effect the observers* data. After each observation, it is our 
policy to have observers fill out a questionnaire concerning the nature and 
source of any biasing information. Thus far, of 75 observations of referred 
families, observers have considered themselves informed only 36% of the time. 
And, in all of these cases, their information was correct. This information 
usually comes from a member of the family being observed (56%). Other sources 
of information include information leaks from the therapists (1 i%) > the 
Child Study Center Clinic generally (16%), and other sources (16%). Of the 
observer considering themselves informed as to the clinic vs. "normal*' 
status of the families, 29% also considered themselves informed as to treat- 
ment stage, but only two-thirds of these observers were correct in their 
discrimination. In only 20% of the cases did the observer actually know 
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the status of the case (i.e., clinic vs. normal) and the treatment stage 
(baseline vs. after baseline). Of the observers considering themselves 
completely uninformed of the families' status, their guessing rate (clinical 
or "normal**) barely exceeded chance at 51%. Their guesses as to the four 
stages of treatment were 36% correct and 80% correct on the discrimination 
between baseline and after baseline. 

Of the ''normal" families seen, observers have considered themselves 
informed as to family status only 17% of the time. However, in only 45% of 
these cases were the observers actually correct In making the discrimination. 
In the uninformed observations, however, observers were able to guess the 
family's status correctly 75% of the time. 

Not only are these questionnaire: brnrficial in gauging the amount of 
potentially biasing information that observers discover, but they are help- 
ful in two other ways as well. First, b^ rt /ealing sources of information 
leakage, steps can be made to eliminate these sources. Second, question-- 
nalres, given after each family is obser cd, serve as a regular reminder 
for the importance of unbiased, objective recording of behavior. 

It is difficult to make any firm conclusions about the presence or 
absence of observer bias in naturalistic observation. Clearly, more research 
is needed on this question. However, it should also be clear that the poten- 
tially confounding influence of observer bias cannot be ignored and that 
steps can and should be taken to minimize its possible effect. 

The Issue of Reactivity in Naturalistic Observation 

In the previous section, we have considered the effects of an observer's 
bias in naturalistic observation. In tlii3 section, we will discuss the 
effect of the observer's presence on the subjects being observed. Whereas 
^ observer bias can potentially invalidate comparisons by confounding in- 

^JMi^ fluences, the reactive effects to being observed primarily constitute a 
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threat to the general! zabili ty of Iho findings. That is, subject:.' o])Scrvcd 
behavior in the natural settinp, may not generalize to their unobr.orvcd 
behavior. Webb, Campbell, Schwartz, and Sechrcst (1966) have defined reac- 



change the behavior of the subject. Weick (l'^68) has also referi'cd to 
reactivity as "interference" or the intrusiveness of the observer hirr.self 
upon the behavior being observed. Clearly, .situations which are highly 
reactive in terms of "observer off^-cto" arc not likely to be generalizable 
to situations in which such effect:, arc al''<ent. 

Reactive effects have been slua'od :h two basic paradigms: a) by 
the study of behavioral stability owi tirra> and b_) by comparison of the 
effects of various levels of obtrus i voncn:: .r .he observation procedure. 
In employing the first method, inv- iv>alor'n Ikivo typically examined be- 
havioral data for change over tiirio in th(^ iDodion level and variance of the 
dependent variable. In general, it has betTi assumed that change reflects 
initial reactivity and progressive adaptation to being observed. This inter- 
pretation IS particularly persuasive if thtr-e Is an obvious stability in 
the data after some initial period of chcii or high variability. While 
this is a viable way of checking ior reacts ^'ity effects, it is a highly 
indirect method and relies on assumptions concerning the causes of observed 
change. It is obvious .hat other processes could account for such change. 
Furthermore, the lack of change certainly noes not indicate a lack of reactive 
effects. The second method, comparing obtrusive levels of observation, 
appears less inferential than the f irst method. The problem with this method 
is that it only provides a picture of relative degrees of reactivity between 
obtrusiveness levels; it does not provide a measure of the degree of reac- 
tivity relative to the true, unobserved behavior. However, this problem can 



tivity in terms of measurement procedures wh. Ich influence and thereby 
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be remedied if one of the observational treatments in the comparison is 
totally unobtrusive or concealed. 

To what extent does reactivity occw in naturalistic observation? ITio 
literature addressing this question is commonly reported in reviews to be 
contradictory (Wiggins, 1970; Weick, 1968; Patterson & Harris, I968). 
Several studies have been cited as providing evidence for the position that 
reactive effects may be quite minimal. Others have been cited which suggest 
that reactive effects are quite pronounced. The purpose of this review is 
to: a_) reconsider the contradictions in the literature on reactivity, b_) 
tease out those factors which seem to account for reactivity, and c_) pro- 
pose further investigations which isolate these factors. 

In a number of reviews on reactivity, several studies have been con- 
sistently cited which support the position that reactivity does not consti- 
tute a major threat to generalizability . One study frequently cited is the 
timely investigation of a Midwest community by Barker and Wright (1955). 
In this admirable study, careful naturalistic observations were made of 
children under ten years of age and their daily interactions with peers 
and parents. The authors assumed that reactive effects were short lived 
and that the adults and other members of the families quickly habituated 
to the presence of the observers. In addition, it was reported that, 
with the younger subjects in the sample, reactive effects were slight. 
However J these findings should be interpreted with much cautior What 
is easily lost sight of in the summaries of this work is that the observers 
in this study were free to interact with the subjects in a friendly but 
nondirective manner. In fact, the basis for the authors* conclusion 
that reactive effects were not pronounced was the 
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findin^^ that "only" 20^ of the chilrivvu'c, boiiaviora] interdict I^.n , v;^ w with 
the observer. Allowing the clsfrv^>- to interact with the subject mutt 
certainly have increased the ihtru:> j ne.-.o of the observer and provi i.^i the 
opportuniry for the observer to iniluenre the subject ^s behavior. Th<^ 
authors' other conclusion that r^victivity, as measurea by frequency of Int^-r- 
actions, positively corielated with a^;e i:, ulso suspect in t^at children 
below the age of five were not always iniormed that they were being observed, 
whereas children above this ar;e wer^ . 

Another study commonly cited in support of the minimal reactivi tv 
position is that of Ba^es (1950). In this controlled laboratory investigation 
the behavior of a discussion group was not found to be ciianged by three 
levels of observer conspicuou^.ness . This finding, however, may be limited to 
the laboratory setting. 

Two additional studic, ivi^.^o i ' iy lu'ntirnr.i ,1- supportLvf^ of tti- mini- 
mal reactivity argument, mad* u ■ cm vaoAo t r<jn: mitt<->- recording In ilie 
naturalistic environment, r.r.r.^in nnd lobn {j^M>3) u id a married couple wear 
a transmitter the entire tiiro ^h(-v were on i two-week vacation. Furcell 
and Brady (1965) outfitted adc^lescrnts in a treatir.^int center with a similar 
recording device for one hour a day. Wlien the protocols in both studies were 
examined for the frequency of co:-iments abou^ being observed or listened to, 
it was found that such references d.eclined to a zero level either durinr the 
first or second day of recorctag. This is rot to say, of course, that these 
subjects were not still aware of, and affected by, the recording device; the 
results only indicate that the subjects calked ai^out the device less after 
the first day. 

A recent investigation by Martin, Gelfand, and Hartmann (1971) can 
also be interpreted as providing evidence for low levels of reactivity to 
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obj^ervation . This study involved 100 olomntary scho >1 childr^^n, af.c . to 7. 






Equal numbers of m<jle dnd iemalo mi.iect:; were af^. irned to fivo ohst^rvai ion 






conditions following exposure to an ar,cressive mod^'I: a) observer <u>r.'nt. 






b) female adult observer present, c) male a^^ult abserver present, d) 






female peer observer present, and v) male peer observer present. During 






the free-play session, the subjects' aggrer^dve behavior wa:^ recordeci bv 






observers behind a one-way mirror. No r.i^^nii i cant differences in agrr-ssive 






behaviors were obtained between th( observor-present and ob:>erver-abs'^nt 






conditions. The absence of diffenincos between these two levels of o: tru- 


r 




siveness in observation suggests little r.r no re<iorivity io the prest^ru-e of 






an observer. Within the observoi -^.r^ -n* '■riviiiu-n, [iovj^vor, it wa.. i ound 






that peer observers signi fi canily i.icllitaied imitative aggr^^ssivf^ n" noncling 






in both boys and girls comparted to .luuJi o.. crvers. Also, there wa-. r.K.re 






imitative aggression when the observer was tho same sex as the subject. The 






girls, but not the boys, showed significant Lncre<l^es in agrit^'r.ive output 






over time when the observer was present but not when the observer was absent. 






This latter finding suggests that girls mai.ifest initial reactivity to 






the presence of an observer but later habituate to the observer's presence. 






It IS interesting that both paradigiris for ineasuring reactivity were used in 






this investigation and that each method supports different conclusions ai)out 






the degree of reactivity. In considering the generalizability ox these 






findings to naturalistic observation procedures, it should be noted that 






observers in this study were instructed to not pay attention to the subjects 






and were either seated facing away from the subjects (adult observers) or 






given a coloring task to complete (peer observers). With naturalistic ob- 






servatioii procedu es , on the other hand, observers typically pay very close 






attention to their subjects. 
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data of tion.il>2 r.f -iicw . to ;.i;;iav oj^ecifi^^ cir. un- 

stancc:.. (e.^;., Baior. , l-t^O; Mirtin i]., i^.'/i) 

Many other r^tudier. have 1 o- n 'tf^^! ^ -i* ^ r,n . t r.:t in>: conGlaor^iiO. r^ac 
tive effect': of olv.' orv -i \ \ on in r^il .a . w ' ' - i c . tiinr^-. One .^ucli studv is 
that of Polaru.r , , Krrr^Tr.iru ilor^^^l^.. Ir.; tj , i ini' , i-appap^.^rt, .uu\ Vh il^^v 
(19U9). Th^.o inv'>sti;',ator'. o!-.^r\..! ^ i'li'/^.^i^ rr.ii.^ron in a study o: 

grou[) emotional conta^icr- phenor:t*^n-'i . !i \ : i,iron -v'ore informed that the 

obcervers wt-re studying th^ir m act i n 
program. The authors report tt.<:t d».r ; 



on 



children erjr,en lially i^^nored ih'> i 
second week, many "blow-ups" o^ mjt 
enpocialJy the older .hiloren. 
nei^G of the children can lu- ^.'-^avi 
observed. Thoy also conco , i 
confounded by '^thc seco^^i w- ^ f 
creasing anti-adujt aggrcf>s i n»' 
have adjusted to the camp, p^a,^ nir 



c va; :nu' if,p^ectr> of the ^ajmrii'^r-camp 
iJk' I irst wcMK oi ohservat ion.s , trie 
' I fh*- 1 ocorw hut, durin^^ liio 
; vtu^; <ii!^u:ted again';! t h^ 
' " f i iate tisat Ihr a.>;)^>'ess ive- 
1 1 t , a. 1" 'Islance to Deir.^>, 

'.1' T i^tanc- Hypothesis was 
' ' 1 i. t i • V dt'scr i : ^ as an in- 
' lilv <'Voives alter the ciiildren 



to whax to conclude frnrr. this rtii]> i: '..ut > t i vi tv . Was redctivitv most 
prevalent when children wr-re 1;;^^.' : ^rv^ud the o: '.erver' in the St'cond 
week.> Or, was reactiviiy mc>s t in\y\' i.^i^.y ih» first week when Ui*^ 

delinquent children we^'»> "sup^rt- « i:: .irrr^-sion prior to nal i tuat ii.g to 
the unfamiliar environment . Anu , u.^r^ inip..nantly, how much of this pattern 
of cooperation in the first week ana anti-nduJt aggressiveness in the second 
would have occurred in the absence of observers? This question is left 
unanswered by the Polansky et al. (19U9) study. 
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A more r.ophi rati r'l^e i r>lAi'ly l\i:'.ti'at: joniu I'TaM o yf^nci Ivc "Ifoctr. 
is that of [Roberts ond Renzagiia (l^C;). In this rtudy, c\f}i\ ihKTuviA.i 
saw two "cJi ents*' (students solicitei I'roni an intro(ractory p^^ycholop^y cour. i^) 
for three sessions. Tnc three conaitions wore: t^) vith n. tape r^^corirr ki 
the room, b) microphone- only, c_) with a ccrirpnled r.icrophone and p. co'icoal^'d 
recorder. '*fnlle no differt ncer> were found a^. to Ih^^ nuir.ber of that 
subjectr, talked in each condition, ii was found that the clicntr> nad'^ more 
favorable self-references when they knew they wcr<» being recorded (cnni it ions 
_a and b , and more unfavorable self-rf^ferencer. when they did not reali:''^ 
they were being recorded (condition c_) . U^iilc the ^cneralizabi] 3 ty of t}iis 
study is limited by the smaii n rma ''tructured r,ituation employtvi, it aon; 
provide evidence for reactivity efi octs. 

'Hie PechLol (1967) 'Mu^iy ha:^ \1 0 h'^en p^'^^*nt«*'i in r^^view:. as •> ' i\Hy 
demonstrating^ considerrio I:' r^ad j v« rre"ti,. i^oi/c*! ^Midi<.'a th<^ mov^ ^:]rut^ 
of people in an art mu.sr-ijri witn 'iutornnte-i rrfj'M'nr, devi^^e, ealio i in 
hodometer. Fn one con'iitic)n, 3ub.i - W(»i't' reqii'-'/ : ed to p,o throup.h t.h^ art 
exliibit in one room and rank Ua^ prints ac:^orain^r j^reference. Anutnt^r 
group was given the same instructic)n:: and also infornt-d that they were b^in/?; 
observed. ?"or the fiist group, i: ibjects v.^re found to ]oavc the room with 
an avera^r^e latency of 313 seconds, "ov^'rin^ an avera/^e ^rea of 76 square f'^et. 
The second group left after 220 r.econd^; and covered only '^8 square feet. 
Bechtel concludes that being c^bserved constituted an ave^^iuve situation which 
led to escaping the room faster. No stc i ratios were prov}dod to determine 
whether or not the reported averages were significantly differenx, nor were 
the number of subjects in each condition reported. Unfortunately, this kind 
of study does not provide any information about habituation effects, since 
subjects were allowed to leave the setting. 
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ment. This article is the only study available which was designed specifi- 
cally to manipulate and measure observer effects In the homes of the 
families observed. In this study, data obtained from mothers on their own 
families were compared with the data on the same family collect. d by an 
outside observer* There were three conditions, with five families per 
condition: a) mothers collected the first five ten-minute sessions of 
observational data and an outside observer collected the second five sessions 
of data on the child and father only b) the observer collected all 

ten sessions as a test for habituation effects (0-0), and c) the mothers 
collected all ten sessions as a control for habituation effects (M-M). The 
dependent variables were the rates of total behaviors and the rate of deviant 
behaviors* A problem in the research design of this study should be noted. 
The mother was present in the family as a participant in the second condition 
(0-0) and the second half of conditijn a_ (M-0), but was not a participant 
when she was an observer in condition £ and the first half of condition a. 
These comparisons are confounded by mother presence and absence. In spite 
of this confound, which would probably bias in favor of showing group dif- 
ferences, no main effects for groups were found in analysis of variance for 
either the rate of total interactions or de\iant behaviors. Thus, on the 
initially selected dependent variables, no reactive efiec'.s were apparent. 

Patterson and Harris also divided their groups into high and low rate 
interactors on the basis of the first five sessions. On the frequency of 
total interactions measure, high rate interactors in the first five sessions 
showed significant reductions in rate during the last five sessions. The 
authors describe this decline as a ^'structuring effect*' in thar the subjects 
appeared to program some activity together in the first five sessions. 
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Conversely, the low rate interactors in the first five sessions showed slight 
increar.es in rates during the last five sessions. The author:* describe this 
transition as an habituation effect in that subjects initially involved 
themselves in solitary activities or attempted to escape the observational 
situation but later adjusted to it and interacted more. In general, there 
were no changes in deviant behavior from the first set of five observations 
to the last set of five. The only significant finding was that subjects who 
displayed low rates of deviant behavior in the first five sessions (under 
the M-0 condition) increased their rate in the last five sessions. Fiowever, 
it is possible that the mothers were x"*ecorJing less deviant behaviors and 
more positive behaviors in the first five sessions than were the observers 
in the recond five sessions, thus contributing differentially to main trials 
effects. An observational study by Rosenthal (1966) supports such a tnesis. 
He found that parents tended to code mor-' positive changes in their children 
than were actually present. And, Peine (1m70) found that parents were less 
observant of their children's deviant benaviors than were nonparent observers. 

Patterson and Harris conclude that "generalization about 'observer 
effects' should probably be limited to special classes of behavior " (p. 16). 
A more recent study by Patterson and Cobb (1971) analyzed the stability of 
each of the 29 behavior codes used in their coding system. If it is assumed 
that individuals adapt to the presence of an observer over time, then a 
repeated measures analysis of variance should reveal differences in the rean 
level of various behaviors. Patterson and Cobb analyzed data for 31 children 
from problem and nonproblem families over seven baseline sessions. None 
of the changes in mean level for the codes produced a significant effect 
over tne. The investigators conclude that the observation daca were 
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fairly stable for most code categories. It is possible, of course, that had 
observations continued over a longer period of time , significant changes in 
mean level for some behaviors would have been discovered. Given that fam- 
ilies were rarely observed on consecutive days by the same observer, it is 
possible that different observers could have resensitized the families each 
day, thereby extending the period required for adaptation. 

In summary, there are a few well-designed studies which have discovered 
reactive effects (e.g., Roberts and Renzaglia, 1965; Bechtel, 1967; V/hite , 
1972), but there are several others where the meaning of the results is 
unclear. There can be little doubt that the entire question has been in- 
adequately researched. Any general conclusions abuut the extent of reac- 
tivitv in naturalistic observation would seem premature at this time. 

As White (1972) points out, the finding of reactive effects seems to 
d'^pend on many factors, including the setting (e.g., home, school, labor- 
atory), the length of observation, and the constraints placed on subjects 
by the conditions of observation (e.g., no television during observations, 
remain within two adjacent rooms, etc.). Furthermore, it should be realized 
that reactivity may or may not be discovered depending upon what paradigm of 
measuremf^nt is used (e.g., Patterson 6 Harris, 1968; Martin et al. , 1971) 
and what variables are analyzed as dependent variables (e.g., Roberts S 
Renzaglia, 1965; White, 1972). Unless these factors are controlled for in 
comparing experiments on reactivity, both contradictions and consistencies 
as to the relative presence or absence of reactivity may falsely appear. 

Assuming that reactivity to being observed in naturalistic settings 
does occur, even if only to some minimal degree, the critical task is to 
localize the sources of interference so that they can be dealt with more 
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directly. Four such sources will be discussed and experiments will be pro- 
posed to measure the extent of their instrusiveness. 
Factor 1^: Conspicuousness of the Observer 

The literature points to the level of conspicuousness or intrusiveness 
o-f* the observer as an important factor contributing to reactivity. Pre- 
sumably, the more novel and conspicuous the agent of observation, the more 
distracting are the effects upon the individuals being observed. It would 
also follow that longer habituation periods would be required for more dis- 
tracting observational agents in order to achieve stability of data. 

Bernal, Gibson, William, and Pesses (l97l) compared two observation 
procedures which would presumably vary on obtrusiveness. These investigators 
compared data collected by an observer with that collected by means of an 
auuj.o tape recorder which was switched on by an automatic timing device. The 
family members involved in this study were aware of the presence of the 
recorder but were unaware of the exact tim- of its operation. The primary 
purpose of this study vas to explore the feasibility of the audio tape 
method and to explore the relationship of data collected by the two methods 
rather than to study reactivity ^er se . The results indi-ated that, during 
the same tim e interval , there was a high relationship between the mother's 
command rate as coded by the obsei*ver and from the tape (r_ = .86) but that 
the observer coded more commands. Sim.ilar results were obtained when the 
observer's data was compared with data based on coding of the audio tapes 
from different time intervals . The question arises as to how much of this 
latter discrepancy was due to differences in levels of reactivity and how 
much was due to differences associated with the source of coding. The 
authors point out, for example, that the observer could code gestural 
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commands while the coder using the tape coula not. Since the discrepancies 
at the same time and at different times were of the same general order of 
magnitude, ?t is likely that most of the observed difference across time 
was due to the material on which coding was based rather than to differences 
in subject reactivity. To study the impact of reactivity effects separately, 
one might design such a study so that the same stimulus materials would be 
used for codint^. 

We are currently completing a study on reactivity which employs this 
strategy to compare reactivity associated with an observer present in the 
home carrying a tape recorder vs. the tape recorder alone. This study 
involves six days of observation tor minutes per day with single-child 
families. The two conditions are alternatea so that the observer is present 
one evening and not present the next. The observer is actually a "bogus" 
observer. All behavioral coding is done on the basis of the tapes. It is 
our suspicion that reactivity to the tape recorder will be short lived and 
minimal compared to the reactivity associa+ with the observer present. 

If these hypotheses are substantiated in this and other research, 
alternatives to having an observer present in the home should be explored. 
One solution to be seriously considered would be extended use of portable 
video or audio tape recording equipment. ITiese recording devices could 
remain in the homes over an extended observation period to facilitate habi- 
tuation effects. In addition, the devices 2ould be preprogrammed to turn 
on and off at different times during the day so that the observed would not 
know when they are in operation (as in Bernal et al. , 197l). This solution, 
which would, of course, require full knowledge and consent of the parties 
involved, appears to be a promising one for attenuating reactivity effects 
as well as solving problems of observer bias. 
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Factor 2_: Individual Differences of the Sub.jects 

Some people might be expected to manifest more reactivity to the presence 
of on observer than others. A "personality" variable such as guardedness 
might be correlated with degree of reactivity. For example, scores on the K 
scale of the MMPI (or other comparable tests) might be related to the effects 
of being observed in a natural setting. 

The literature also suggests that age is correlated with reactivity. 
Several authors (Barker & Wright, 1955; Polansky et al. , 19^9) have suggested 
that younger children axe less self-conscious and thereby less subject to 
reactive effects than older children. The Martin et al. (l97l) study also 
suggests that sex might be an important factor accounting for different 
levels of reactivity • Experiments are needed which compare these individual 
difference variables in the natural setting with naturalistic observation 
procedures . 

Factor 3,: Personal A ttributes, of the Observer 

Evidence from semi-structured intervje^r, suggests that reactive effects 
may also be contributed by the unique attiioutes of the observer. Different 
attributes of the observer may elicit different roles on the part of the 
subject, depending upon what might be appropriate given the observer's attri- 
bute. Rosenthal (1966) reports several such attributes that have been 
demonstrated to yield differential effects, including the age of the observer, 
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sex, race, socio-economic class, and the observer's professional statur, (i.e., 
undergraduate observer vs. Ph.D. therapist). Martin et al. (1971) also dis- 
covered that both the factors of age and sex of the observer had differential 
effects on the subjects being observed. Varying any of these dimensions 
parametrically would be relatively simple in investigating this problem in 
the natural setting. 

Factor Rationale for Observation 

Another factor that may be impoxtant in accounting for reactivity is 
the amount of rationale given subjects for being observed. Whereas the 
Bales (1950) study found no differential reactivity of thrse levels of ob- 
server conspicuousness in a group-discussion setting, Smith (3957) found that 
nonparticipant observers aroused hostility and uncertainty among partici- 
pating group members. Woick (1968) suggesi s that this discrepency may have 
been a function of different amounts of rationale for the presence of an 
observer. We hypothesize that a thorough rationale for being observed might 
be expected to reduce guardedness, anxiety, etc., and thereby reduce the 
reactivity . 

Observer reactivity is a problem that cannot be easily dismissed for 
naturalistic observation. There is sufficient evidence to suggest that ob- 
server reactivity can seriously limit the generali^ability of naturalistic 
observation data. C3 early, factors accounting for reactivity need to be 
investigated and solutions derived to minimze the effects of the observer 
on the observed. In the next section, we will describe how reactivity, in 
addition to posing a problem for generalizability , can also interact with 
and confound the dependent variable. 
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Obscrv^^e Bias: 
Demand Characteristics, Ror>ponGe Sets and Fakability 
Reactivity to observation will always be a problem for naturalistic 
research, but it would be a relatively manageable one if we could assume it 
to be a relatively constant, nonintoractive effect. That is, if we knew 
that the presence of an observer reliably reduced activity level or deviant 
behavior by 30%, for example, the problem would not be too damaging to 
research investigations involving groups of subjects. But, what if the 
observe e»s reactivity to being observed interacts with the dependent variable 
under study. 

Let us take the example of a tre itmrnt study on deviant children in which 
observations are taken prior to and after ireatnient. Prior to treatment, the 
appropriate thing for involved parent ; or* reachers to do is to make their 
referred child appear to be ''leviant: in ord t to justify treatment. The 
appropriate response at the end of tivatniCit, on the other hand, is to make 
the child appear improved in order to just fy the termination, please the 
therapist, etc. These are the demand char. -cteris tics of the situation. In 
this case, the reactivity to being observed is not constant or unidirectional, 
but interacts with and confounds the depencent variable. It is possible that 
any improvement we see in the children's b< navior is simply the result of 
differential reactivity as a consequence ol the demand characteristics of the 
situation. Now, let us suppose we employ e wait list control group and 
collect observational data twice before be^ inning treatment and at the same 
interval as used for the treated group. This procedure provides an excellent 
pre test-post- test control for our treated group. But, what of the demand 
characteristics of this procedure? On the first assessment, the involved 
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parents or teachers will probably behave in the same general way as their 



be more desperate for help and even more concerned to present their child 
as highly deviant. Thus, simply as a result of the demand characteristics 
involved, we might expect our treatment group to show improvement while the 
control groups would show some deterioration. 

We also may wish to compare our referred children with children who 
are presumably "normal" or at least not referred for psychological treatment. 
Once again, however, we might anticipate that parents recruited for "norma- 
tive" research on "typical" families would be more inclined than our parents 
of referred children to present their wards as nondeviant or r,^od. In 
other words, a response set of social desirability could ho operative with 
this sample making them less directly comparable to the referred sample. 

These arguments would, of course, be even more persuasive if we were 
dealing with the observed behavior of the adults themselves. The foregoing 
observations on children assume, however, that the involved aanlts are 
capable of influencing children to appt-ar relatively "deviant" or "normal" 
if they wish to do so (\.e., that observational data on children is poten- 
tially fakable by adult manipulation). 

We have just completed a study (Johnson 6 Lobitz, 1972) which was 
directed at testing this assumption. Twelve sets of parents with four- or 
five-year-old children were instructed to do everything in their power to 
make their children look "bad"' or "deviant" on three rlays of a six-day home 
observation and to make their children look "good" or "nondeviant" on the 
remaining three days. Parents alternated from "good" to "bad" days in a 
counterbalanced design. 



counterparts in the treated group, but by the second observation they may 
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Four predictions were made ref^;arding the behavior of both children 
and parents- During the "fake bad" periods, it was anticipated that, 
relative to the "fake good" periods, there would be: 

a) more deviant child behaviors, 

b) a lower ratio of coraplietnce to parental commands, 

o) more "negative*' responses on the part of parents, and 
d) more parental commands. 

Predictions a^, £, and d were confirmed at or beyond the ,01 level of 
confidence. Only the child's compliance ratio failed to be responsive to 
the manipulation. It will be recalled from the section on reliability that 
this statistic is by far the least reliablo and thus the least sensitive 
(statistically) to manipulation. These results which demonstrate the 
fakability of naturalistic behavioral daba indicate that this kind of data 
may potentially be confounded by demand characteristics and/or response sets. 

We are aware of only one other study involving naturalistic observation 
which helps demonstrate this problem (Hortcn, Larson, & Maser, 1972). This 
study involved one teacher who was under the instruction '^f a "master" 
teacher for the pui^pose of raising her classroom approval behavior. She was 
observed, without her knowleage, by studeni s in the class. The results 
clearly showed that her approval behavior vas at a much higher rate when 
she was being observed by the "master" teacher than when she was not being 
observed. Generalization from overtly observed periods to periods of 
covert observation was very minimal indeed. More generalization was found 
when the "master" teacher's presence in the classroom was put on a more 
random schedule. This study is not completely analogous to most naturalistic 
research because, in this case, the observer and trainer were the same person 
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and the study is limitod in general! zabili ty because of the N = 1 desir,n. 
Yet, in most cases, the observed are aware that the collected observational 
data will be seen by the involved therapist, teacher, or researcher, and 
if the problem exists for one subject, it is a potential problem for all 
subjects. Observee bias is really a special case of subject reactivity to 
observation. Thus, the potenti solutions outlined in the previous section 
apply here as well. In general, we suspect that observation procedures which 
are relatively unobtrusive and which allcw for relatively long periods of 
adaptation will yield less reactivity and observee bias. 



Just as behaviorists have ignored the requirement of classical roli- 
al^ility in their data, they i.ave also neglected to givo any syste matic 
attention to the concept of validity. Most research investigations in the 
behavior modification literature which havr employed observational methods 
have relied on behavior sampling in only one narrowly circumscribed situ- 
ation with no evidence that Ihe observed behavior was representative of the 
subject's action in other stimulus situations. In addition, behaviorists 
have largely failed to show that the obtaired scores on behavioral dimensions 
bear any relationship to scores obtained or the same dimensions by different 
measurement procedures. This fact calls into serious question the validity 
of any of this research where the purpose has been to generalize beyond the 
peculiar circumstances of the narrowly defiaed assessment situation. Of 
course, the methodological problems w<^ iu.ve presented thus far all pose 
threats to the validi'ty of the behavioral scores obtained. But, we would 
argue that even if all these problems could somehow be magically solved, the 
requirement for some form of convergent validity would still be essential. 



Validity of Naturalistic }3ehavioral Data 
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As with rt?J i ability , thoro ar'-* many different inoi)tods of validation, but as 
Campbell cind Tiske (19SM) pr^int out: 

Validation is typically conve;'^^,ont ; a confirmation by inUcM ndont 
measurement procedures. Indep(>ndence of methods is a common 
denominator among the major types of validity (excepting content 
validity) insofar as they are to be distinguished from reliabil- 
ity. . . . Reliability is the atrreement between two ffo-^ts to 
measure the sarrie trait through maximally similar methods. Val- 
idity is represented in the ap,i'^ement between two att(M^pts to 
measure the same trait through inaxirrally different metlicds. 

Thus, convergent validity is ecitablished when two dissimilar metJiods of 
measuring the same variable yield similar or correlated results. Predictive 
validity is established when the measure of d behavioral dimen Ion correlates 
with a criterion established by a dissimilar moasurcment instr:nent. 

with only a few exceptions, ii(*i)avi ori sts have rr^tricted ' lienisc ivrr. to 
face or content validity. And, of course, it must be admitted that tho 
face validity of narrow ly-derincd ivlor.i] variables is ofton cjuit^^ ,>f^r- 
suasive. This is particularly tru^' in cast •> wnero the behavioral dim'^n->ion 
under study has very narrow breadth or "banc^ widtli." After all, a behavior- 
ist mi^ht arRue^ what can be a mon^ valid i rasiire oi the rate oi a child^s 
hitting in the classroom thar a straight-fc j^ard , accurate count that 
hitting. While this argument is persuasive , two counter arguments must be 
considered. P^irst, because of all of the PietlicdolorJcal problems which we 
have presented thus far, we can never be certain that" the observed rates 
during a limited observation period are comoletely valid or generalizahle 
even to very similar stimulus stiuations. While many of the problems we 
have outlined can be solved and othersattenaated , it is unlikely that all 
will ever be completely eliminated. Second, is it not still of consequence 
to know whetner our behavior rate estimates have any relationship to other 
important and logically related external variables? Is it not important, 
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for example , to know whether or not the teacher and clasf^mater. oi an o\>: r;rv(xi 
high-rate hitter perceive this child as a hitter? It does seem important to 
us, particularly for practical clinical purposes, since we know that people's 
perceptions of others* behavior often have more to do with the way they 
treat them than does the subject's actual behavior. Thf need for establish- 
ing some form of convergent validation becomes even more profound as the 
behavioral dimensions we deal with increase in band width. As we he^^in to 
talk about such broad categories as appropriate vc . inappropriate behavior 
(e.g., Gelfand, Gelfand, ^> Dobson, ]9b7), deviant v^. nondeviant behaviors 
in children (e.g., Patterson, Ray, Shaw, 1969; Johnson et al. , 1972), or 
friendly vs. unfriendly behaviors (e.f., Raush, j9t)5), we are laholinf, 
broader behavioral dimensions. At this lev,.;l, we arc dealing with constructs, 
whether we like to admit it or not, and the importance of establishing, the 
validity of these constructs becomes crucir.i. In most cases, these bread 
behavior categories have been made up of a collection of more discrete be- 
havior categories and, in general, the investigators involved have simply 
divided behaviors into appropriate-inappropriate or devianl-nondeviant on a 
purely a priori basis. While the categorizations often make a good deal of 
sense (i.e., have face validity), this hardiy seems a completely satisfactory 
procedure for the development of a science '>f behavior. 

We have had to face this problem in our own research, where we have 
sought to combine the observed rates of cerJ-ain coded behaviors and come up 
with scores reflecting certain behavioral dimensions. The most central 
dimension in this research has been the ''total deviant behavior score'* to 
which we have repeatedly referred in this chapter. Lot us outline here the 
procedures we have used to explore the validity of this score. Although 
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we had a pretty good iuoa of whicti .r.ila bchaviot^b would be viewod a.: 'Moviant" 
or *'bad** in this culture-, we attcmi-tod to enhrtnce the consensual [acv valid- 
ity of this score by asking parents of the nomal" children we obGervni to 
rate the relative deviancy of each of the codes we use in our research. Thus, 
in our sample of 33 families of four- and five-year-old children, we arked 
each parent to read a simplified vM^sion of our coding manual and charac- 
terize each behavior on a three-point -^calo from "clearly deviant" to "clearly 
nondeviant and pleasing." Me osta! lishod r\u arbitrary cut-off score and 
characterized any behavior ^tbove this cut- off ar deviant. This resulted in a 
list of ]5 deviant behaviors out of a tot<'l of 3S codes. The r>econd ' *e| in 
validating this score and our implicit tle\i mt -nondovi ant dim^^nsion 
presented in a study by Ack5nr> ririu foiinsoi (1072;. a*-^ had alrc-idy divi(i<"i 
our 35 code, into positive, ':ogcitiv»', and n utral ronsequenceo . Thi'' cjt.'~ 
gorization was done on a pur?ly a \ riori 1 ic>ir> with, a little he]p from tl!'> 
data provided by Patterson a-Ki Coui' v 19" i ) ^^n tue f unction of somo of these 
codes for eliciting and maintaining chil^ir n*s behavior. We reasont^o that 
behaviors which parents viewed as Uiore dev anl w^uld receive relativ^'ly 
more negative consequences than would behaviors viewed as less deviant. To 
test this hypothesis, v.e >imi.ly rank ordered each behavior, first by tl . 
mean parental verbal report .core oiitained and second by the mean proportion 
of negative consequences the behavior recer^ed from family members. The 
results of this procedure are presented in Table 1. Not all 35 behaviors are 



included in this analysis, but the complex reasons fc'^ this outcome can more 



Rank Order Correlation between the two methods of characterizing behaviors 



Insert Table 1 aboi.t here 



parsimoniously be explained in a footnote 



In any case, the Spearman 
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on the deviant-nondeviant dimension wan .73. Thic was an encour<3p.in;; finding, 
but we noticed that the most dramatic exceptions to a more perfect ar,i'Gemcnt 
between th€ two methods involved tho reasonable comaiand codes (command 
positive, and command negative). These codes are used when the child reason- 
ably asks someone to do something (positive commana) or not to do something 
(negative command). Naturally, most parents felt that these innocuous re- 
sponses were nondev.rant . But, behaviorally , people don*t always do what they 
are asked to by a four- or five-year-old child, and since noncompliance 
was coded as a negative consequence, it seemed that this artifact of ir 
characterization might have artificially lowered this coefficient. By elim- 
inating these two command categories from tne calculation, the correlation 
coefficient was raised to .81. 

The third piece of evidence for the validity of the deviont behavior 
score comes from the Johnson and Lohitz (1S72) study already reviewer: iri 
the previous section. In this study, parents were asked to make their child- 
ren look "good" and "nondeviant" for half of the observations and "bad" or 
"deviant" on the other half. They were nor told how to ^iccomplish this, nor 
were they told what behaviors were considered "bad" or "deviant," The fact 
that the deviant behavior score was consistently and significantly hi^^^her 
on the "bad" days lends further evidence for the construct validity of the 
score. 

While evidence for the convergent or p.^edictive validity of behavioral 
data is difficult to find in the literature , there are soir.e encouraging 
exceptions to this general lack of data. Patterson and Pcid U971), for 
example, found an average correlation of .63 (p < .05) between parents' ob- 
servations of their children's low rate referral symptoms on a given day and 
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the trained observer's tally ni tar^-.ottod deviant behaviors on that duy.: 
Several studies have found significant relationships between b^ihavioiMl 
ratings of children in the clacsroom and academic achievement (Meyers, Attweii, 
& Orpet, 1968; D'Heurle, Miliinger, 6 Haggard, 1959; Hughes, 1968). Tue 
data base of these studies is somewhat different from that currently employed 
by most behaviorists because they involve ratings by observers on relatively 
broad dimensions, as opposed to behavior rate counts. For example, dimensions 
used in these studies included "coping strength," defined as ability to attend 
to reading tests while being subjected to delayed auditory feedback (Hughes, 
1968), or "persistence," defined as . . uses time constructively and to 
good purpose; stays with work until finished" (r'Heurle, Bellinger, f> lia^^gard, 
1959). Nevertheless, these studi-^s r'emoMStrritr tr.e potential for behavior 
observation data to provide eviden^^e of prodictivo validity. Two otri^^r 
studie*^ (Cobb, 1969; Lahaderne , l'V38) yi ild similar predictive validity 
findings based on behavioral rate Jita. L \naderri*==^ (1968) found that -attending 
behavior as observed over a :wo-r.iontr p*^ri'Ai, provided correlations ran ring 
from .39 to .51 witi* various standard test: of achieve:. ent. Even with intel- 
ligence level controlled, sif-^nif icant corr* lations between attentive r-ehavior 
and achievement were found. Coob Q9b9) ol 'ained similar results in corre- 
lating various behavior rat^^ scores with aiithmatic achievement, hut found no 
significant relationship between these behavior scores- and ac[ i^*vor.cnt in 
spelling and reading. These predictive validity studies are very important 
to the development of the field as they sug^^est that manir ulation of these 
behavioral variables nay well result in productive changes in academic achieve- 
ment . 

In our own laboratory, we are exploring the convergent vilidity of 
naturalistic behavioral data by relating it to measures on sinilar dimensions 
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in the Ic -^tory which include; a) parent and child interaction behavior in 
standard stimulus situations similar to those employed by Wahler (1967) and 
Johnson and Brown (1969), b) parent behavior in response to standard stimu- 
lus audio tapes similar in design to those used by Rothbart and Maccoby 
(1966) and parent behavior in standardized tasks similar to those used by 
Berberich (J970), and c) parent attitude and behavior rating measures on 
their children. Unfortunately, at this writing, most of this data has not 
been completely analyzed, but an overall report of this research will be 
forthcoming. A recent dissertation by Martin (1971), however, was devoted to 
studying the relationships between parent behavior in the home and parent 
behavior in analogue situations. By and Jarge, the results of this research 
indicated no systematic relationships between the tv.'o measures. The sdrr.e 
general findings for parents' responses to deviant and nondeviant behavior 
were replicated in the naturalistic and thc^ analogue data, but correlations 
relating individual parental behavior in one setting with that in the other 
were generally nonsignificant. V/e don't ki ow , of course, whicn , if either, 
of the measures represents "truth" but thiz study underlines the importance 
of seriously questioning the assumption- us^^ally made in any analogue or 
modified naturalistic research. As Martin (1971) points out, these negative 
results ar^ -ery representative of findings in other investigations where 
naturalistic behavior data has been compared to data collected in more arti- 
ficial analogue conditions (e.g., see Pawl, 1963; Gun^p S Kounin , 1960; 
Chapanis, 1967). 

Before closing this section on validity, we would like to briefly 
take note of the efforts of Cronbach and his associates to reconceptualize 
the issue of observer agreement, reliability and validity a-s parts of the 
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broader concept of generali Zribilii v . A full decoration of goneralizabi lUy 
theory goes far beyond the purposes of this chapter and the intorostod 
reader may be referred to several primary and secondary sources for a more 
complete presentation of this model (e.g., Cronbach , Rajaratnam, £ Gieser, 
1963; Rajaratnam, Cronbach, Gieser, 1965; Gieser, Cronbach, G Rajaraxnaii, 
1965; Wiggins, 1972). ^ According to this generalizabi lity view, the concerns 
of observer agreement, reliability and validity all boil down to a coiuorn 
for the extent to which an obtained score is generalizable to the "universe'* 
to which the researcher wishes the score to apply. Once an investigator 
is able to specify this "universe,'' he stiould be able to specify and tost 
the relevant sources of possible threat to generalizability . In a typical 
naturalistic observational study, for *>xample, we would usually at kvist 
want to know the generalizability of data across a) observers, b) occa Ion.: 
in the same setting, and _c) settings. Through the generalizability moMoi, 
each of these sources of variance could be explored in a factorial desi^^n and 
their contribution analyzed within an analysis-of-variance model. This model 
IS particularly appealing because it provides for simultaneous assessment of 
the extent of various sources of "error" whicn could limit generalizability. 
In spite of the advantages of this factorial model, there are few precedents 
for its us . This is probably more the result of practical problems rather 
than a resistance to this intellectually appealing and theoretically sound 
model. Even if one were to restrict himself to the three sources of variance 
outlined above, the resulting generalizability study would, for most useful 
purposes, be a formidable project, indeed. Projects of this kind appear to 
us, however, to be well worth doing and we can probably expect to see more 
investigations which employ this generalizability model. 
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It should bo pointed out- at this point that the >^,enorali;:ahili tv .tudv 
outlined above does not really opeak to the traditional validity rotjui roni-nt 
as succinctly definf^d by Campbell and Fiske (1969): '^Validity Ig roprosentod 
in the agreement between two attempt^s to measure the same trait through 
maximally different methods." As stated earlier, .o fulfill this require- 
ment, one must provide ' evidence of some form of convergent validity by the 
use of methods other than direct behavioral observation. The generaliz- 
ability model can, theoretically, handle any factor of this type under -^h^ 
heading of methods or "conditions,'' but the analysis-of-variance model 
employee requires a factorial design. Thus, it would seem extremely dif- 
ficult and sometimes impossible to integrate factorially other methods of 
testing or rating in a design which encoinpassec] the three variables outlined 
above: observers, occasions and .settings. As a result of these considerations, 

we question the extent to which one generalizabili ty study, at least in tiiis 
area of research, can fulfill all the requirements of observer agreement, 

validity, and reliability which we view as so important. Rather, it Is likely 
that multiple analyses will still be necessary to sufficiently establish 
all of the methodological requirements we have outlined for naturalistic 
observational data. These m^iltiple analyses may, of course, involve analyses 
of variance in a generalizability model or correlational analyses as tradi- 
tionally employed. 

Krantz (1971) points out that the basic controversy over group vs. 
individual subject designs has contributed largely to the development of the 
mutual isolation of operant and nonoperant psychology. Since he measurement 
of reliability and convergent validity is typically based on correlations 
across a group of subjects, the operant psychologist may feel that these are 
alien concepts which have no relevance for his research. We would dispute 
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tln'G view on the following logical /roundrw Rc liability involves thr- require- 
ment, for consistency in measurement and without some minimal lev^l of iiuch 
consistency, there can be no demonstration of functional relationships be- 
tween the dependent variable and the independent variable. Efforts are 
currently underway to discover statistical procedures for establishing reli- 
ability estimates for the single case (e.g., see Jones, 1972). Any operant 
study whicli inv,:'lves repeating mani[)ulative procedures on more than one subject 
can be used for reliability assessnienu by traditional methods. Once such 
reliability is established, either for thp individual case or for a group, we 
can be much more confident in the data and its meaning. Validity involves 
the requirement of convergence among different methods in measuring the same 
behavioral dimension. Where the Vriliditv ol mea.^'urement procedure has 
been previously established for a group, we can us*^ it with more confidence 
in each individual case., V/here it has not, it is still possible to explore 
for convergence in a single case. We can simply s^^e, tor example, if the 
child who shows high rates of aggressive behavior is perceived as aggr^'ssive 
by significant others. This procedure may be done with some precision if 
normative data is available on the r.ieasures used in the single case. Thus, 
with normative data available one can explore the position of the single case 
on the distribution of each measurement instrument. One could see, for 
example, if the child who is perceived to be among the top 5% in aggressive- 
ness actually shows aggressive behavior at a rate riigher than 95o of his 
peers. The requirements of reliability and validity are logically sound ones 
which transcent experimental method and means of calculation. 

These methodological issues, like all others presented in this chapter, 
are highly relevant for behavioral research, even though they may at first 
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seem alien to it as the products of rival sciiools of thouf^ht. Tt ban been our 
argument that the requii^einents of nound methodolopy tram;, nd " chooln" and 
that the time has come for us to attend to any variables wSiU h threaten the 
quality, generalizability , or meaningfulness of our data- Behavioral 
data is the most central commonality and critical contribution of all be- 
havior modification research. The behaviorists ' contribution to the science 
of human behavior and to solutions of human problems will largely rest on 
the quality of this data base- 
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Table 1 

Coded Behaviors as Ranked by Two Methods: 
Parental Ratings and Negative Social Consequences^ 



Behavior Rank by 
Parental Rating 



Behavior 
Rank oy 
Proportion 
of Negative 
Consequences 



Mean Parent 
Rating for 
Behavior 



Proport ion 
of Nor.ative 
Consequences 
to Behavior 



1 


Whine 


13 


l.'i^^f^ 


.12^ 


2 


Phyr;ical Negative 


2 


1,07U 


.527 


U 


Do*; tractive 


0 


1-?0U 


-352 


U 


Toase 


5 


1-20U 


-382 


^ 


Smirt Talk 


*U 


1-204 


-300 


6 


Avopsive Command 


3 


1-208 


-';?8 


7 


Noncompliance 


12 


1-278 


- ]7S 


s 


Rate 


16 


1,307 




o 


Ifnore 


•11 


1-370 


-r-'Of. 


10 


Yoli 


10 






11 


pomand Attention 


15 


1-611 


-083 


12 


N'rf;<itivism 


6 


1-685 


- 375 


13 


CrnM^-tnd Negarive 


1 


i-on;-. 


. : -0 


iu 


Disapproval 


9 


1-8V0 


-235 


15 


Cry 


lu 


1.0o2 


-007 


ir> 


Indulgence 


22 


2.003 


.02*' 


17 


Command Prim' 


27.5 


2.132 


-GOG 


18 


Receive 


18 


2.222 




19 


Talk 


23 


2.278 


?C} 


20 


Command 


7 


2.296 


- 35 5 


21 


Attention 


25 


2.5^>r> 


.013 


22 


ToMrh 


20 


2.6'*8 


-0U3 


23 


Independent Activity 


26 


2.70U 


.005 


2U 


Physical Positi^^'e 


21 


2.7**1 


. 3U 


25 


Comply 


17 


2.759 


.053 


2b 


Laugh 


19 


2.778 


-OUU 


27 


Nonverbal Interaction 


21; 


2.833 


-012 


28 


Approval 


27.5 


2.926 


.000 



it 

Spearman Rank-order correlation between columns 1 & 2 = .73 (p < .01). 
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METHODOLOGICAL ISr>ULr> IN NATURALISTIC OBSERVATION: 
SOME PROBLEMS AND SOLUTIONS TOR FIELD RESEARCH-^ 
Stephen ^. John:;on .irwl Orin D. Bolstad 
University cf Oregon 

Ericapsulated schools cf thouglit have occuri^ed in all science:, at 
some stage in their develop^nent . Thoy c^pear most frequently during 
periods where the fundamental assumptions of the science are in question. 
Manifesto papers, acrimonious controversy, mutual redaction, and isola- 
tion of other schools' strategies are hallmarks of such episodes [David 
L. Krantz, The separate worlc's of operant and non-operant psychology. 
Journal of Applied Behavior Analysis , 1971, U (1), p. 61]. 

History may well reveal that the greatest contribution of behavior 
modification to the treatment ^f hunar. problems came with its emphasis on 
the collection of behavioral data' in natural ^ettinps. The growth of tiio 
field will surely continue to produce greater refineir.ent an*! proliferation 
of specific behavior change proc3dure3, bur the critical standard for 
a3i>essing toeir utility will very likely remain the same. We will always 
want to know how a given procedure affects the subject's relevant behavior 
in his "real" world. 

If a behaviorist wants to convince someone of the correctness of his 
approach to treating human problems, He is generally much less likely to 
rely on logic, authority, or personal testimonials to persuade than are 
proponents of other schools of psychotherapeutic thought. Rather, it is 
most likely that he will show his behavioral data with the intimation that this 
data speaks eloquently for itself. Because he is aware of the research on 
the low level of generalizability of behavior across settings (e.g., see 
Mischel, 1968), he is likely zo be more confident ^, this data as it 
becomes more naturalistic in character (i.e., as it reflects naturally 
occurring behavior in the subject's usual habitat). As a perusal of the 
behavior modification literature will indicate, these data are often 
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extremely persuasive. Yet, the apparent success of behavior modification 
and the enthusiasm that this success breeds may cause all of us to take 
an uncritical approach in evaluating the quality of that data on which the 
claims of success are based. A critical review of the naturalistic date 
in behavior modification research will reveal that most of it is gathered 
under circumstances in which a host of confounding influences can oper- 
ate to yield invalid results. The observers employed are usually aware 
of the natiire, purpose and expected ^^esults of the observation. The 
observed are also usually aware of being watched and often they also know 
the purpose and expected outcome of the observation. The procedures for 
gathering and computing data on observer agreement or accuracy are inap- 
propriate or irrelevant to the purposes of the investigation. Tfiere is aLT.ost 
never an indication of the reliability of the dependent variable under study, 
and rarely is there any systematic data on the convergent validity of the 
dependent measure(s). Thus, by the standards employed in some other areas of 
psychological research, it can be charged that much behavior modification 
research data is subject to observer bias, observee reactivity, fakability, 
demand characteristics, response sets, and decay in instrumentation. In 
addition, the accuracy, reliability and validity of the data used is often 
unknovn or inadequate?.y established. 

But, the purpose of this x^aper is not to catalogue our mistakes or to 
argue for the rejection of all but the purest data. If that were the case, 
we would probably have to conclude with that depressing note which makes 
so many treatises on methodology so discouraging. Although dressed in more 
technical language, this purist view often expresses itself as: "You 
can't get there from here-" We can get there, but it's not quite as 
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simple PS perhaps we were first led to believe. The first step in getting 
there is to define and describe those factors which most often jeopardize 
the validity of naturalistic behavioral data. To this end, we will review 
a host of investigations from many laboratories which demonstrate these 
methodological problems. The second step is more constructive in nature: 
to suggest, implement, and test the effectiveness of various solutions to 
these dilemmas of methodology. Because behavioral data has become the 
primary basis for our approacn to diagnosing and treating human problems, 
the endeavor to improve methodology is perhaps our most critical task for 
strengthening our contribution to the science of human behavior. 

We will argue that the same kinds of methodological considerations 
which are relevant in ocher areas of psychology are equally pertinent for 
behavioral research. At least with respect to the requirements of sound 
methodology, the time of isolation of behavioral p ,ychoiogy ^rom other 
areas of the discipline should quickly come to an end. 

Throughout this paper, we will rely heavily on the experience of our 
own research group in meeting, or at least attenuating, these problems. 
We take this approach to illustrate the problems and their possible solu- 
tions more precisely and concretely. Most of our solutions are far from 
perfect or final, but it is our hope that a report based on real experi- 
ence and data may be more meaningful than hypothetical solutions which 
rem.\in ut cested. Thus, before beginning on the outline of methodological 
problems and their respective solutions, it will be necessary for the 
reader to have a general understanding of the purposes and procedures of 
our research. This research involves the observation of both "norm?l^^ 
and "deviant" children and families in the home setting. The observation 
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system employed is a modified form of the code devised by Patterson, Ray, 
Shaw, and Cobb (1969). This revised system utilizes 35 distinct behavior 
categories to record all of the behaviors of the target child and all behaviors 
of other family members as they interact with this child. The system is 
designed for rapid sequential recording of the child's behavior, the respon- 
ses of family members, the child's ensuing response, etc. Obser\ations 
are typically done for forty-five minutes per evening during the pre-dinner 
hour for five consecutive week nights. The observations are made under 
certain restrictive conditions:; a) All family members must be present in 
two adjoining rooms; b) No interactions with the observer are permitted; 
c) The television set may not be on; and, d) No visi'.ors or extended tele- 
phone calls are pennitted. Obviously, this represents a modified natural- 
istic situation. 

On the average, these procedures yield the recording of between 1,800 
and 1,900 responses and an approximately equal number of responses of other 
family agents over this time period of 3 houi s a-d 1+5 minutes. This data is 
collected in connection with a numbsr of interrelated projects. These include 
normative research investigations of the "normal" child (e.g., Johnson, Wahl, 
Martin S= Johansson, 1972); research involving a behavioral analy:;is of the 
child and his family (e.g., Wahl, Johnson, Martin S= Johensson, 1972; 
Karpowitz, 1972; Johansson , Johnson, Martin, & Wahl, 1971); outcome research on 
the effects of behavior modification intervention in families (Eyberg, 
1972); comparisons of "normal" and "deviant" child populations (Lobitz & 
Johnson, 1972); and studies of methodological problems (Johnson & Lobitz, 19^2; 
Adkins S= Johnson, 1972; Martin, I971). These latter studies will be 
reviewed in detail in the body of -Lh. 3 paper. More recently, we have begun 
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to investigate the generality of children's behavior across school and home 
settings , and to document the level of generalization of the effects of 
behavior modification in one setting to behavior in other settin-,s (Walker, 
Johnson, & Hops, 1972). Research is also in progress to relate naturalistic 
behavioral data to parental attitudes and behavioral data obtained in more 
artificial laboratory settings. With all of these objectives in mind, it 
is most critical that the behavioral da^a collected is as valid as 
possible and it is to this end t'lat we explore the complex problems of 
methodology presented here. 

Observer Agreement and Accxxracy I: 
Problems of Calculation and Inference 
The most widely recognized requirement of research involving behavioral 
observations is the establishment of the accuracy of the observers. This is 
typically done by some form of calculation of agreement between two or more 
observers in the field. Occasionally, observers are tested for accuracy by 
comparing their coding of video or audio tape with some previously established 
criterion coding of the recorded behavior. For convenience, we will refer to 
the former procedure as calculation of observer agreement and the latter as 
calculation of observer accuracy. In general, both of these procedures have 
been labeled observer reliability. We will eschew this terminology because i. 
tends to confuse this simple requirement for observer agreement or accuracy 
with the concept of the reliability of a test as understood in traditional 
test theory. As we shall outline in section three, it is quite possible 
to have perfect observer agreement or accuracy on a given behavioral 
score with absolutely no reliability or consistency of measurement in the 
traditional sense. Generally, the classic reliability requirement involves 
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a demand for consistency in the measurement instrument over time (e.g., 
test-retest reliability) or over-sampled item sets responded to at roughly 
the same time (e.g., split-half reliability). An example may help clarify 
this point. If two computers score the Sc^ne MMPI protocol identically, 
there is perfect '^observer agreement" but this in no way means that the 
MMPI IS a reliable test which yields consistent scores. Although the 
question of reliability as traditionally understood has been largely ignored 
in behavioral research, we will argue in section three that it is a critical 
methodological requirement which should be clearly distinguished from ob- 
server agreement and accuracy. 

There is no one established way to assess observer agreement or 
accuracy and that is as it should be, because the index must be tailored 
to suit the purposes of each individual investigation- There are three 
b^sic decisions which must be made in calculating observer agreement. The 
first decision involves the stipulation of the unit score on which the index 
of agreement should be assessed. In other words, what is the dependent 
variable for which an index of accuracy is required as measured by agree- 
ment with other observers or with a criterion? An example from our own 
research may help clarify this point. We obtain a "total deviant behavior 
score" for each of the children we observe. This score is based on the sum 
output of 15 behaviors judged to be deviant in nature- An outline of the 
rationale and validity of this score will be given in a later section. 
Suffice it to say, whenever two observers watch the same child for a given 
period, they each come up with their own deviant behavior score. These 
scores may then be compared for agreement on overall frequency. It is 
obvious that the same deviant behaviors need not be observed to get high 
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indexes of aLreement on the total number of deviant behaviors observed. 
V^-^, for many of our purposes, this is not important, since we merely wrint 
an index of the overall output of deviant behavior over a given period. 
The saifle procedure is, of course, applicable to one behavior only, chains 
of behavior, etc. The point is that the researcher must decide what unit 
is of interest to him xor his purposes and then compare agreement data on 
that variable. In complex coding systems, like the one used in our labor- 
atory, it has been customary to get an overall percent agreement figure 
which reflects the average level of agreement within small time blocks 
(e.g., 6-10 seconds) over all codes. In general, we would argue that this 
kind of observer agreement data is relatively meaningless. It has limited 
meaning because it is based on a combination of codes, some of which are 
observed with high consensus and some which are not. Furthermore, the figure 
tends to overweight those high rate behaviors which are usually observed 
with greater accuracy and underweight those low frequency behaviors which 
are usually observed with less accuracy. Patterson (personal communication) 
has reported that the observer agreement on a code correlates .U9 with 
its frequency of use. Since it is often the low base rate behaviors which 
are of most interest to researchers, this overall index of observer agree- 
ment probably overestimates the actual agreement on those variables of 
most concern. 

The second question to be faced involves the time span within which 
common coding is to be counted as an agreement. For most purposes of our 
current research, score agreement over the entire 225 minutes of observa- 
tion is adequate. Thus, when we compute the total deviant behavior score 
over this period, we do not know that each d)server sees the same deviant 
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behavior at the same time. But, good agreement on the overall scoi^ tolls 
us that we have a consensually validated estimate of the child *s overall 
deviancy. For some research purposes, this broad ti^me span for agreeinont 
would be totally inadequate. For conditional probability analysis of one 
behavior (cf. Patterson 6 Cobb, 1971), for example, one needs to know 
that two observers saw the same behavior at the same time and (depending 
on the question) that each observer also saw the same set or chain of 
antecedents and/or consequences. This latter criterion is extremely 
stringent, particularly with complex cedes where low rate behaviors are 
involved, but these criteria are necessary for an appropriate accuracy 
estimate . 

Once one has decided on the score to be analyzed and the temporal 
rules for obtaining this score, one must then face the problem of what to do 
with these scores to give a numerical index of agreement. The two most 
common methods of analysis are percent agreement and ?ome form of correla- 
tional anal'^cis over the two sets of values. Both methods may, of course, 
be used for observer agreement calculation within one subject or across a 
group of subjects. Once again, neither method is .ays appropriate for 
every problem and each has its advantages and disadvantages- The most 
common wa> of calculating observer agreement involves the follovring simple 
formulae 

numl^er of agreements 

number of agreements + disagreements 

What is defined as an agreement or disagreement has already been solved if 

one nas decided on the ''score" to be calibrated and the time span involved, 

Uce of this formula implies, hov/ever, that one must be <-ibIe to dis- 
criminate the occurrence of both agreements aiid disagreements. This, can 
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only be accomplished precisely when the time span covered is r iativoly 
small (e.g., I-15 seconds) so that one can be reasonably sure that two 
observers agreed or disagreed on the same coding unit. It has been common 
practice for investigators to compare recorded occurrences of behavior 
units over much longer time periods and obtain a percent agreement figure 
between two observers which reflects the followmr,: 

smeller number of observed occurrences 
larger number of observed occurrences 

The present authors would view this as an inappropriate procedure because 
t].are is no necessary "agreanent" implied by the resulting percent. If 
one observer sees 10 occurrences of a behavior over a 30-minute period and • 
the other sees 12, there is no assurance that they were ever in ap.reement. 
The behavior could have occurred 22 or more times and there could be abso- 
lutely no agreement on specific events. The two observers did not necessarily 
agree 6U% of the time. Data of this kind can be more appropriately analyzed 
by correlational methods if such analysis is consistent with the way in 
which Che data is employed for the question under study. Although the sajne 
basic problem mentioned above can, of course, occur, the correlational 
method is viewed as more appropriate because; a) The correlation is computed 
over an array of subjects or observation time segments and b) The correlation 
reflects the level of agreement on the total obtaired score and it does not 
jJT^ply any agreement on specific events . 

Whenever using thp appropriate method of calculating observer agreement 
nprrpnt (i p- number of agreements . , 

percent, vi.e. ^^^^^ agreements + disagreements ^ investigator 

should be particularly cognizant of the base rate problem. That is, the 
obtained percent agreement figure should be compared with the amount of 
agreement that could be obtained by chance. An example will clarify this 
point. Suppose two coders are coding on a binary behavior coding system 
(e.g., appropriate vs. inappropriate behavior). For the sake of illustra- 
tion, let us suppose that observers have to characterize the subject's 
behavior as either appropriate or inappropriate every five seconds. Now, 
let us suppose, as is asually the case, that most of the subject's behavior 
Is appropriate. If the subject's behavior were appropriate 90% of the time. 
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two observers cod** * ^ randomly .xt these base rates (i.e., •90-,lU) will obtain 

82% agreement by chanc<^ alone. Chance agreement is v^omputed by squaring the 

3 

base rate of each code category and summing these values. In this simple 

2 2 

case, the mathematics would be as follows: ,90 + ,10 = .82* Hie san.e pro- 
cedure may, of course, be used with multi^ccde systems. 

The above .90-. 10 split problem iii«y be reconceptualized as one in which 
the occurrence or nonoccurrence of inappropriate behaviot is coded every five 
seconds. If, for purposes of computing observer agreement, we lock at only 
those blv cks in which at least one of two observers coded the occurrence 
of inappropriate behavior, the chance level agreeni'»nt is drastically reduced. 

The probability that two observers would <,odo ot onrrence in the same block by 
2 

chance is only .10 or one perc<-n^. * It would n bo theoretically inappro- 
priate to count agreement on nonoccurrence but , in tlu" present ex*iinple and 
in most cases, this procedure ie associated with relatively high levels of 
chance agreement • 

Whenever percent agreement data is reported^ the base rate chance agree- 
ment should also be reports j and tb^ differenc^i noted. Statistical tests of 
that difference can, of course, be computed, As long as che base rate data 
is reported, the percent agreement figure vould always seem to be appropriate. 
For obvious reasons, however, it becomes less satisfactory as the chance agree- 
ment figure approaches 1,0, 

The other common "aethod of computing agreement data is by means of a corre- 
lation between two sets of observations, Tne values may be scores from a group 
of subjects or scores from n observation segments on one subject. This method 
ia particularly useful when one is faced with the high chance agreement problem 
or tere the requirement of simple similarity in ordering subjects on the depen- 
dent variable is sufficient for the research. As we shall illustrate, the 



ERIC 



Johnson and Rolstad 10-b 
correlation is dlso particularly useful in cases where one has a limited 
sample of observer agreement data relative to the total amount of observation 
data. In general, correlations have been used with data scores based on 
relatively large time samples. In other words, they 
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tend to be used for summary scores on individuals over periods oi 10 nn'nutes 
to 2*4 hours. There is no reason wiry correlation methodology could not bo 
applied to data from smaller time segments (e.g., 5 seconds), but this has 
rarely been done. So, studies using correlation methods have generally been 
those in which one cannot be sure that the same behaviors are being jointly 
observed at the same time. In using correlation methods for esti::>ating 
agreement, one should be aware of two phenomena. First, it is possible to 
obtain high coefficients of correlation when one observer consistently 
overestimates behavioral rates relative to the other observer. This dif- 
ference can be rather large, but if it is consistently in one direction, 
the correlation can be quite high. For some purposes this problem would ..e 
of little consequence but for other purposes it could be of consider^ii lo 
importance. The data can be examined visually, or in other more syntomatic 
ways, to see to what extent this is the cai;e . This problem can be virtue 
eliminated if one uses many observei-s and arranges for all of them to cali- 
brate each other for agreement data. Under these circumstances, one will 
obtain a collection of regular observer figures and a list of mixed cali- 
brator figures for correlation. This procedure should generally correct for 
systematic individual differences and make a consistent pattern as outlined 
above extremely unlikely. The second probj^em to be cognizant of in using 
correlations is that higher values become more possible as the range on 
the dependent variable becomes greater. This fact may lead to high indexes 
of agreement when observers are really quite discrepant with respect to 
the number of a given behavior they are observing. An illustration may 
clarify this point. Let us suppose we are observing rates of crying and 
whining behavior in preschool children over a five-hour period. Some 
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particularly "good" children may display these behaviorr; vory litLlo and, 
given a true occurrence score of 7, two observers may obtain scores of 5 
and 10 on this behavior class. This would be only 50% agreement. Other 
children display these behaviors with moderate to very hif:n ^quency.^ For 
a child with high frequency, we may find our two observerii riving us scores 
of 75 and 125 respectively. This would be equivalent to 60% agreement and^ 
of course, represents a raw discrepancy of 50 occurrences. if these 

examples were repeated throughout the distribution of scores -ina if there 
were little overlap, a high correlation would be obtained. This would be 
even more true, of course, if one observer consistently over* ".imated the 
rates observed by the other. Yet, even this possibility doe^ not necessarily 
jeopardize the utility of the method. It must merely be, recor.nized, examined 
and its implication for the question under study evaluated. In our own 
research we want to catalogue the deviancy rates of normal chi'iaren, coinpare 
them with deviant children, and observe changes in deviancy rat.es as a 
result of behavior modification training with parents. For th^se purposes, 
general agreement on levels of deviant responding is quite good enough. 

In our research on the normal child, we have had ^7 families of the 
total 77 families observed for the regular five-day period by an assip;ned 
observer. On one of these days an additional observer was sent to the family 
for the purpose of checking observer agreement. The correlation between the 
deviant behavior scores of the two observers was .80. But, in a pure] 
statistical sense, this figure is an underestimate of what the agreomont 
correlation would be for the full five days of observation. Since we are 
using a statistic based on five times as much dpt8, ve want to know the expected 
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observer agreement correlation for this extended period. Addinp; time to an 
observation period is analogous to adding items to a test. The problem we are 
faced vith here is very similar to that dealt with by traditional test theorists 
who have sought, for example, to estimate the reliability of an entire 
test based on the reliability of some portion of the test. In our case, 
we want to know the expected correlation for the statistic based on five 
days when we have the correlation based on one day. The well-known 
Spearman-Brown formula (Guilford, 195^) may be applied to this end (as in 
Patterson, Cobb, & Ray, 1972; Patterson & Reid, 1970; Reid, 1967).^ 

^t 



where r^^ = reliability of the test of unit length 
n^ = length of total test. 
With the Spearman-Brown correction, the expected observer agreement corre- 
lation for the deviant behavior score js -95. This sar;e procedure has also 
been applied to other statistics of particular interest in this research 
including: a) the proportion of the parent's generally "negative" responses 
(correct agreement = .97), b) the proportion of the parent's generally 
positive responses (corrected agreement = .98), c_) the median agreement 
coefficient of the 29 behavior codes observed for five or more children 
(corrected agreement = .91 ), d) the median corrected agreement of the 11 
out of 15 deviant behavior codes used (r = .91), e) the number of parental 
commands given (corrected agreement = .99), and f) the compliance ratio 
(i.e., compliances/compliances plus noncompliances) of the child (corrected 
agreement = .92). As our research is completed, we will be presenting 
observer agreement data uaing different statistics, computed in different 
ways, and evaluated by different criteria. 
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The primary point of this section is to indicate that th^^re arc many 
ways of calculating observer agreement data and there is no one "right way 
to do it." The methods differ on three basic dimensions: a) Mie nature 
and breadth of the dependent variable unit, b) the time span covered, and 
£) the method of computing the index- Each investigator must iiiake his own 
decisions on each of these three points in line with the purpooes of his 
investigation. But, the investigator should be guided by one central 
pres crip t ion- - the agreement data should be computed on the score used as the 
dependent variable . It makes no sense to report overall average agreement 
data (except perhaps as a bow to tradition) when the dependent variable 
is "deviant behavior rate." In addition, it makes little sense to make 
the agreement criteria relative to time span more :;tringent than necessary. 
If the dependent variable is overall rate of deviant behavior for 3 five- 
day period, then this is the statistic for which agreement should be com- 
puted. It is not necessary for this limited purpose that both observers 
see the same deviant behavior in the same brief time block. 

Before closing this section on the computation of cbserver agreement, 
we should address the somewhat unanswerable question of the minimum criteria 
for the acceptability of observer agreement data. In other words, how much 
agreement is sufficient for moving on to consider the results of a particular 
study. When using observer agreement percent, it would seem reasonable, at 
the very minimum, to show that the agreement percent is greater than that 
which could be expect -d by chance alone. When dealing with correlation data, 
one should at least show the obtained correlation to be statistically signi- 
ficant. These criteria are, of course, extremely minimal and certainly far 
below those criteria commonly used in traditional testing and measurement 
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to estal:)lish reliability (e.g., :.oo r.uill<nHi, JubU). Y^it , these rritprid 
do provide a reasonable lowest lev^l Gtandara m<\ thrrc aw some very rood 
reasons why we should not be overly conservative on this point. In tno 
first place, very complex codes, which may provide us with some of our 
most interesting findings, are very difficult to use with complete arcuracy. 
On the basis of our experience, and that of G. R. Patterson (personal communi- 
cation), we see an overall agreement percent of 80% to 85% as traditionally 
computed as a realistic upper limit for the kind of complex code we are 
using. 

Furthermore, to the exten : that less than perfect agreement represents 
only unsystematic error in the dependent variable, it cannot be considered 
a confounding variable accounting for p ositiv e results. Any positive f L ndinr. 
which emerges in spite o f a good deal of '*no ise" or error variance is probably 
a relatively strong effect . 

Low observer agreement does, however, have very important implications 
for negative results. This gets us back to the fundamental principle that 
one can never prove the null hypothesis. The more error in the measurement 
instrument, the greater the chance for failing to discover important pheno- 
mena. Thus, just as with traditional test i-eliabili ty , the lower the ob- 
server accuracy, the less confidence one can have in any negative findings 
from the research. 

Observer Agreement and Accuracy II: 
Generalizability of Observer Agreement Data 
All of the preceding discussion on the calculation of obsorver agree- 
ment data relies on the assumption thax the obtained estimates of agree- 
ment are generalizable to the remainder of the observers' data collection. 
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In most naturalistic behavioral rest^rch, howover, this ar. umpr icm cannot 
go unchallenged and this bring., us to our next, and Lirgolv :uble. 
methodological problem. To illustrate this pro ^^m, let u:. lake the not 
untypical case of an investigator who trains his observers on a behavioral 
code until they meet the criterion of two consecutive observation sessions 
at 80% agreement or better. After completing this training, the investi- 
gator embarks on his research with no further assessment of oh^ nvver agree- 
ment. There are three basic problems with this methodology which make the 
generalizabilit^r of this agreement data extremely questionable. These 
problems are a) the nonrandomness of the selected data points, b) the 
unrepresentativeness of the selected data points in terms of the time of the 
assessment, and c) the potential for the c^bserver's reactivity to being 
checked or watched. The first two problems may be rather easily solved in 
all naturalistic research, but the third problem represents quite a challenge 
to some forms of naturalistic observation.. Let us explore these problems in 
more detail. The nonrandomness of selecting the last two "successful" 
observation sessions in a series for establishing a true estimate of 
agreement should be very obvious. It is not unlikely that, had the investi- 
gator obtained several additional agreement sessions, he would find the 
average agreement figure to be lower than 80%. It is quite possible that 
our observers had, by chance, two consecutive "good days" which are highly 
unrepresentative of the days to come. One can almost visualize our hypo- 
thetical investigator, after the first day of highly accurate observation, 
saying to his observers, "That was really a good one; all we need is one 
more good session and we can begin the study." But, now we are getting 
into problems two and three. 
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The second problom of unroprc:;oritatlveno'^;s in terms of time r.'i's, : . 
viously been discussed by C impbGll and C)lanloy ( 1966) and laO'-'lou in trumont 
de cay . That is, estimat*^'. of observer accurary obtained onu week may not be 
representative of obsrr or accuracy the next week. The longer the research 
lasts, the greater is the potential problem of instrument decay. In 
case of human observers', the decay may result from processes of forgetting;, 
new learning, fatigue, etc. Thus, because or instrument decay , our investi- 
gator's estimate of 80% agreement Is probal)jy an exaggeration of the rrue 
agreement during the study itself. The problem of instrum<^nt decay is also 
often compounded by the fact that during observer training,, there is usually 
a great deal of intense and concentrated work with the code, coupled with 
extensive training and feedback concerning observer accuracy. This inten- 
sity of experience and feedback is usually not maintained throughout the 
course of the research, and, as a result, the two time periods are charac- 
terized by very different sets of experiences for the observers. The third 
problem of generalizability of this agreement data involves the simple fact 
that people often do a better, or at least a different, job when they are 
aware of being watched as opposed to when they are not. Campbell and 
Stanley (1966) have labeled this problem reactive effects of testing . It 
is likely that, when observers are being "tested" for accuracy^ they will 
have heightened motivation for accuracy and heightened vigilance for criti- 
cal behaviors or for the coding peculiarities of their calibrator. This 
point has been brought home dramatically to us on more than one occasion 
by the tears of an observer after earning a particularly low agreement 
rating. Thus, because of the reactivity problem, estimates of observer 
agreement obtained with the awareness of the observer are likely to over- 
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estimate the true agreement level wiiich would be obtained if Iho ohsorvfr 
were not aw^re of such calibration. 

Fortunately, all of the preceding logical arguments have been investi- 
gated in some recent research largely contributed by John Reid of the Oregon 
Research Institute. In his first published study on this problem, Reid 
(1970) designed a study which, from the observer's point of view, was almost 
identical to the hypothetical example given earll r» In this stacy, ob- 
servers were trained for a maximum of eight days on complex behavioral coding 
of videotape material in a laboratory setting. They were trained ur il they 
obtained 70% agreement on two consecutive days for 15 minutes of obsorvation 
per day. In each training session, feedback was given to the observers, 
concerning their accuracy, but in the immediately following assessment .ses- 
sion no such feedback was given. After training was completed, the observers 
were led to believe that they would be the only ones seeing the new tapes 
because criterion protocols had not been previously constructed and 
the tapes would have to be erased immediately after coding so that they 
might be reused. Observers were urged, therefore, to be very careful in 
their subsequent coding. On each of four consecutive days after reaching 
criterion, the observers were asked to complete 30 minutes of observation 
on similar stimulus materials. The fi/'St 5 minutes of each session served 
as a practice period in which the exper:.menter gave feedback to the ob- 
server when necessary. Although the observers were not aware of it, 
criterion protocols had been prepared on these videotape materials, thus 
providing a covert assessment of observer accuracy. Throughout the study, 
observer accuracy was computed by comparing the data entry with the st^ln- 
dard protocol. Both errors of commission and omission were counted. This 
is the type of overall and generally stringent accuracy criteria which we 
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have earlier rocomniendc^d againM for monl vosoarch purpos^^a. i -mv ^ /^Tit, 
all observers reached the 70o ap,reoment criterion at or befor'" [rhx davs 
of training and the average agreement during the two criterion cJ<iys wa^ 
75.50%. The average agreement during the following four days, howevor, was 
only 51.25%, representing a drop of nearly 25 percentage points. Now it is 
obvious that, in this first study, all three confounding problems wore 
present--a) unrepresentativeness of the criterion days, b) inr^trum^^nt (iocay 
due to passage of time, etc., and £) reactivity to being tostod. Thi- is 
the procedure, however, which is representative of many invostiga t icn in 
the field. Reid's study certainly demonstrates the potentially poworfiil 
impact of these confounding influences. Because of the relatively short 
time span involvt3d in this study, it is likely that most of this effect 
is attributable to the reactivity problem. 

A more recent study by Taplin and Rcid (in preparation) sheds gn.ater 
light on the contribution of each of these confounding influences. Al- 
though some of the details of this latter -^tudy differed from the earlier 
investigation, the basic procedures and concepts involved were the same. 
After training to criterion of 70 o agreement, the observers were assigned 
to one of three groups: a_) No-cbeck group — the procedures in this group 
essentially replicated those of the former study such that the observers 
were unaware of agreement assessment beyond the criteriori period- b) 
Spot-check group--in this group, observers were told that their accuracy 
would be checked from time to time during the experimenta] phase, but that 
they would be aware of when these checks would be made. This group controls 
for the problem of unrepresentativeness of criterion day performance as 
well as instrument decay, but does not control for reactivity to testing, 
c) Random-check group — in this group, the observers were told that their 
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accuracy would be checked randomly in 20% of the sessions but tiMt tlK-y 
would not be aware of when these checks occurred. This procedure not only 
controls for the first two problems, but also controls for the reactivity 
problem by making it constant . In other words, it is likely that the know- 
ledge of a '^potential" assessment of accuracy will increase the ol'5:erver*s 
'motivation and vigilance. Yet, this is not a confounding influence since 
any random assessment of accuracy in the series will not be unreprer.cMUd- 
tive of the other points in the series. Thus^ whenever prcictically 
possible, this strategy of random assessment without observer aw-ironc .j> or 
assessment on every observation occasion is most consistent with ri^'^c.rous 
research requirements. The results of the Taplin and Reid (in pre paration) 
study are presented in Figure 1 and are of particular interest 'u»io for 
the way in which they illustrate the operation of the three confounding^ 



problems* The discrepancy between percent agreement on the criterion dayi> 
and on the days of data collection is obvious and significant lor all groups. 
Thus, the use of this kind of previously collected criterion d'ita for 
generalization to subsequent data collection appears highly in i,>propriate , 
regardless of the method of data collection or the rationale pj ovided 
observers during the data collection period, In all groups, zh^. confounds 
of unrepresentativeness of criterion day data and instrument decay from 
the criterion days are obvious. Now, the observer accuracy data produced 
on the spot check days are not unrepresentative, in the sense th^t these 
days were presumably not chosen initially as particularly ''good" ones, as 
were the criterion days. In addition, this procedure controls for the 
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probloms of instrument 'l**^cay due to par>r,rij;e of time since ^h'^ci- - ■ - .'v 

randomly throughout the period of dua collection. It is clear rii it 
"spot check" accuracy asGesr.ment produced less overestimation ol tru*' locur- 
acy than did the criterion day estimates. The data in thiv^ "spot check" 
group is very clear, however, in illustrating the problem of olr.(M-'ver r*Mc- 
tivity. The agreement porcents in the two spot check periods are c](virlv 
overestimates of agreerrK^nt perconti^ obtained under covert assesrr c^nr . ihn ^ ^ 
the "spot check" method repr(;Gents a decided improvement over the crit'^rion 
score method » but it is an imperfect ::olution. 

In some naturalistic GCttin^vs, the "rpot cyieck" method may hav- t<. 
remain the optimal solution for rea.onr of cost ai the pc-tential for Mie 
reactive effects inherent in u- ing multipl*^ ol .e. v*-rs coi:t inuoir. Iv (e.)>., 
in observations in homes or other Cf^nfinod quarter ). A.*^ previou- ly t<ited, 
the methods of total accuracy as';er,:;mcnt (o.g., a;, u .'d by Wahler, In 7, nnd 
Browning r>tover, 1971) or raiKlnin accur-.-io/ .1: • -isiiM^iit without .iwm';i^ 
(as in Taplin Heid, in preparation) are ilwayr. pr«'f*-xMb3e when po.rilie. 
These methods are, of course, particularly simple to apply with vidoc or 
audio tape materials or in natural settings wh< r e two or more observer 
are, for whatever reason, employed .simultaneously and continuously. In 
classrooms, for example, it is often the case that two or more obrerver'; 
record the behaviors of two or more childr<»n. Under these circumstance:;, 
the investigator can arrange the observers' recording schedules .,0 that their 
observation of subjects overiap at random ximes. In this way, two obsorv^^rs 
can record the behavior of the same subject at the same tine without either 
having knowledge of the ongoing calibration for agreement which is occurring 
at that specific time* This procedure would replicate the "random check 
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group" of Taplin and Reid (in preparation) in a field setting. It would 
probably be difficult, if not impo:;>Gible , to keep the fact of random cali- 
bration a secret from the observers for any extended period, but, as stated 
earlier, this is no real problem, because the randomly collected data with- 
out specif ic awareness is representative of accuracy at other times. The 
Taplin and Reid (in preparation) data would suggest that the motivational 
effects of informing observers of the random checks slightly incroanes the 
level and stability of their accuracy scores. (Compare the three groups' 
accuracy level and stability in the data collection period in Fir.urp ].) 

In more recent research, Reid and his colleagues have directed tnoir 
efforts to finding ways of eliminating; the instrument decay or ''ohsorvw 
drift" observed in all previous studies regardless of the method of moni- 
toring. In several long-term research projects, including our own (e.g., 
Johnson, Wahl, Martin C Johnnsson, 1*17?), the one directed by 0. v. 
Patterson (e.g., Patterson, Cobb, £ Ray, l<i72) and the one reportori by 
Browning and Stover (1971), continuous training, discu5:sion of the coding 
system, and accuracy feedback are provided for the observer^;. It is possible 
that this kind of training and feedl^ack could eliminate, oi at least atten- 
uate, observers' accuracy drift as well as the problem of the unropresenta- 
tiveness of "spot check" accuracy assessments. To test this iiypothe^i s , 
DeMaster and Reid (in preparation) designed a study in which f roc levels 
of feedback and training during data collection were compared ^n a sample of 
28 observers. The observers were divided into lU pairs and all subsequent 
procedures were carried out in the context of these fixed pairs. The three 
experimental groups were as follows: Group I — Total reedback--In this group 
observers a) discussed their observation performance together while reviewing 
their coding of the previous day's video tape, b) discussed their previous 
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dfiyVs observation with the cxpcri n. ntor in trrms of their apvonmi n\ witn the 
criterion coded protocol, and c) Vi^c* ivod .a d'lily n^port oi th- ir acMii^icy 
with respect to the criterion protocol. r,roup Il^-zair Af,ro(-,Kn.t iN'('i:Mck-- 
In this group, observers were given ti.o opportunity to discus; thoir [,<>rlorn- 
ance as in a above and b)were given a daily report on the r^xtent tr which, 
each observer's cooing protocol agreed wiin the protocol of th*^ thor ob- 
server. Subjects in this group were deprivec^ of a discunsion or report of 
'heir level of agreement wIlH the crit^-^rion protocols. <">roup III--No 
reedback--Subjects in ttiis group were deprivt-d of rhe kind:, oi l"odbarl' 
given in the previous two conditions and were instructed n'- t ic discing- 
their work among themselves to eliminate possible "bias of the oata." 
This group was similar in conc< )t to the random-check group in ihe T<iplin 
and Keid (in preparation) study in that they wore told, as vjor,- all r>i\)or 
subjects, that their accuracy would > e chfcW^d at : ii.dom interval;, in \ 
data collection period. The 'lenondent variable wei e a) the ag^roonuM* ^.corcs 
between pai^s of observers and b) the "accuj-acy** scores rt-flected by the 
percent agreement with the criterioii protocol's. The results showod tiiat the 
intra-pair observer agreeinent scores were significantly hj>;iio; tiian wore 
scores reflecting agreement with the criterion. These r^siJts tend to 
corroborate the hypothesis foivarded by Baer , Wolf, and Risley ^nd 
Bijou, Peterson and Ault (1^)68) that higii intra-pair a;^reomr'nT ^ , .s not 
necessarily reflect proper use of the coding /tom. We siiall call tr;is 
problem "consensual observer drift." It is very important to note, sowevor, 
that the design of this study which placed observers in lixec: ind unchanging 
pairs would tend to maximize this effect. In the field stu'iies referred to 
above, observers typically meet in larger groups for trainin;p and feedback 
and observers rotate in calibrai Lng each other's observations, ^nder these 
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circumstances, the effects of consensual drift would logically be expected 
to be less potent. Indeed, further data from the DeMaster and Reid (in 
preparation) study lends support to this argument. On those video-tape 
materials where more than one pair of observers had coded the sequr-c.> 
the investigators compared the fixed pair agreement with the agreement 
between observers in other pairs. In all cases, the fixed pair agreed more 
with one another than they did with the observers in the other pairs. Thus, 
this idiosyncratic drift of fixed pairs may be greater than drift exper- 
ienced under currently t^mployed field research procedures* Yet, a recent 
study by Romanczyk, Kent, Diament , and O'Leary (1971) showed that during 
overt agreement assessment observers would change their coding behavior 
to more closely approximate the differential coding styles of their cali- 
brators. Thus, it is possible for observers to produce one kind of consen- 
sual drift with some calibrators and an opposite consensual drift with others 
to yield artifically high observer agreement data- 

The manipulations in the Romanczyk et al. (197?) study were quite 
powerful, however, and one can question the generalizability of these 
artificially induced conditions to real field studies. Nevertheless, this 
study does demonstrate the potential for powerful and differential consensual 
drift. In spite of these considerations, one must realize that it is im- 
possible in an ongoing field observation to have a "pure" criterion protocol, 
since one cannot arbitrarily designate one observer's protocol as the "true" 
criterion and the other as the imperfect approximr.e. But, one can atten- 
uate this problem considerably by having frequent training sessions with 
observers on pre-coded video-tape material or on pre-coded behavioral 
scripts which may be acted out live by paid subjects. The importance of 
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this recommendation is underlined L-y DeMaster and Heid*s (in 'l ir-ition) 
second important finding. Analysis of the data indicated a oi nif leant main 
effect for feedback conditions, with the total feedback group doing best, 
followed by the intra-pair feedback group and the no feedback group, 
respectively. 

It may be of inte5?est to review briefly how our own project stacks up 
with regard to these considerations and to suggest ways in whicn it and 
similar projects might be improved in this area. Initial obs-rver tr^iining 
in our laboratory consists of the following program: a) ^eadin,- and study 
of the observation manual, h) completion of programmed instruction m<3tGria]^ 
involving precoded interactions , -c) participation in daily intensive training 
sessions which include discussion or the system and coding of preceded 
scripts which are acted out live by paid but nonprofessional victors, d) 
field training with a more experienced observer followed immediately by 
agreement checks. Currently, when an observer obtains five sessions with 
an average overall percent agreement of 70 -o or better, she may begin regular 
observation without constant monitoring. All observers continue to partici- 
pate in continuous training and are subject to continuous checking with 
feedback. This is accomplished in two ways. First, each observer is 
subject to one spot-check calibration for each family she observes. This 
calibration may come on any one of the re,;ular five days of observation. 
Beth observers figure their percent agreement in the traditional way 
immediately after the session and discuss their disagreements at thi:- time. 
If they cannot resolve their disagreement on a particular or idiosyncratic 
problem, they call the observer crainer immediately who serves as sort of an 
imperfect criterion coder. From time to time, idiosyncratic problems arise 
which cannot be resolved by the coding manual alone. Decisions on how to 
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code thes3 special cases are made by the group and the trainer and are 
entered in a ^'decision log" which is periodically studied by all observers. 
These special circumstances are unfortunate and provide an opportunity for 
consensual drift, but are part of the reality with which we must deal. The 
decision log" helps attenuate the drift problem on these de cisions , and nost 
of them tend to be idiosyncratic to one or two families. The second aspect 
of continual training involves a minimum of one 90-minute training session 
per week for all observers involving discussion and live coding experience. 
We have been negligent in our procedures in not retaining our precoded 
scripts over time and recoding these from month to month and year to year. 
On the basis of our review of Reid's excellent work, we have now begun to 
correct this error by retaining these scripts and subjecting them to recoding 
periodically to check the problem of "consensual observer drift." As will 
be obvious, we use the imperfect method of "spot check" calibration for 
observer agreement, but Reid's data is encouraging in that it indicates that 
the kind of intensive and continual training outlined here may attenuate the 
problems associated with this method. Furthermore, our observers are con- 
vinced that calibration scores obtained on a single day of observation are 
probably lower than would be obtained over two or more days of observation. 
The reason for this belief is that the calibrator would logically have more 
difficulty in adapting to each new home environment and identifying the 
subjects of observation on the first day in the home than on subsequent 
days, unfortunately, we have no hard data to prove this hypothesis, but we 
have begun to do more than one day of calibration on families in order to 
test it. 

The problem of cons^^nsual drift is also attenuated in this project by 
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the practice of having each observer calibrate all other ohservors. 
recently began to employ only one calibrator for reasons of convenience and 
cost, but this review has persuaded us to return^ at least partially > to 
multiple calibration among all observers. 

As stated earlier, the problems associated with reactivity to testing 
for observer agreement could largely be solved by procedures which involved 
coding of audio or video tapes. This is true because one could arrange 
calibration on a random basis without observer awareness. Because proce- 
dures of this kind could also solve or attenuate problems of observer bias 
and subject reactivity, we are beginning to consider procedures of this type 
more seriously for future research and are now involved in pilot work on 
the feasibility of these methods. Short of this, we must be content with the 
"spot check" method as outlined and attempt to attenuate the problems asso-> 
ciated with this method by use of extensive training and feedback as 
suggested by DeMaster and Reid (in preparation). 

Reliability of Naturalistic Behavioral Data 

One must look long and hard through the behavior modification literature 
to find even an example of reliability data on naturalistic behavior 
rate scores. In classical test theory, the concept of reliability involves 
the consistency with which a test measures a given attribute or yields a 
consistent score on a given dimension. Theoretically, a test of intelli- 
gence, for example, is reliable if it consistently yields highly similar 
scores for the same individual relative to other individuals in the sample. 
There are several approaches to measuring reliability including split-half 
measures, equivalent forms, test-retest methods, etc. Each inethoa nas a 
somewhat different meaning, but the basic objective of each is an estimate 
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of the consistency of measuroment . It is difficu]t to tell whether i en<ivior- 
ists have simply neglected, or deliberately rejected, the reliability re- 
quirement for their own research. The concept comes out of classical test 
theory and is obviously allied to trait concepts of personality. Behavior- 
is ts may feel that the concept is irrele/ant to their purf osp^^. After all, 
we know that there is often very little proven consistency in human behavior 
over time and stimulus situations (e.g., see Mischel, 1968), r.o why should 
we require a consistency in our measurement instruments thax Is not present 
in real life? Behaviorists may feel that reliability is an ou^riiodcd con- 
cept and belongs exclusively to the era of trait psychology. 'f this is, 
in fact, the reason for the neglect of the reliability issue in behavioral 
research, it represents a serious conceptual error and a clear misapplication 
of the meaning of the data on the lack of behavioral consistency so elo- 
quently summarized by Mischel (196 8). It is true, of course, that behav- 
iorists employ more restricted definitions of the topography of the relevant 
response dimensions (e.g., hitting vs. aggression) and that they often in- 
clude more restrictive stimulus events in defining these dimensions (e.g., 
child noncompliance to mother's commands vs. child negativism). Yet, the 
fact remains that we are still dealing with scores that reflect behavioral 
dimensions. If the word "trait" offends, then another label will do as 
well. Furthermore, the scores are obtained fo^ the same purposes that trait 
scores are obtained-~to correlate with some other variable. Generally, 
behavior modifiers "correlate" these scores with the presence or absence 
of some treatment procedure but certainly our data is not limited to this one 
objective. In our own research, for example, we are currently comparing 
children's deviant behavior rates in their homes with their deviancy in the 
school classroom (Walker, Johnson, S Hops, 1972) and comparing the deviancy 
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rato.s of norn^al childrr-n vn'tl) those ohr.orvofi in referred or "dovi/jnr" ' 
children (Lobitz f, John-^on, Vi72) , Thr most elementary knowlod)V^ of rhf^ 
concept of reliability tells us thai some minimal level of behavior .core 
reliability is necessary before we can ever hope to obtain any significrnt 
relationship between our behaviora] score and any external variable. Thus, 
the requirement of score reliability is just as important in researcn 
employing behavioral assessment as it is in more traditional forms of psy- 
cholof^ical assessment, but with only a few exceptions (e.g., Cobb, l^<r/l ; 
Harris, 1969 ; Olson, 19 30-31; Patterson, Cobb, & Ray, 197?) behaviori;,t r> 
have ignored this important issue. 

As a consequence of the reasoning presented f^hove , we have been p^ir- 
ticularly cognizant of the reliability of the scores used in our research. 
We were quite encouraged to find, for example, that the odd-even-spii r -naif 
reliability of our "total deviant behavior score" in a sample of 33 "rionnai" 
children was .72. This reli.ability was computed by correlating the total 
deviant behavior score obtained on the first, third, and first half of the 
fifth day with the same score obtained from the remainder of the period. 
After applying the Spearman-Brov/n correction formula, we found that the 
reliability of this score for the entire five-day ooservation period was .83. 
This relatively high level of reliability indicat^^^ that this score should, 
at jeast in a statistical sense, be quite sensitive to manipulation or to 
true r^e lationships with other external varial^les (e.g., social class, or 
educational level of the parents). Other behavioral scores which are im- 
portant to our research include: a) the proportion of generally negative 
responses of the parents (corrected reliability = 90), b) the proportion 
of generally positive responses of parents (corrected reliability = .87), 
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c) the median reliability of the 35 individual codes (r = -69), d) the 
corrected median reliability of the deviant codes = .66, 3) the number of 
parental commands during the observation (corrected reliability = -BS), and 
f_) the compliance ratio (i.e., compliances/compliances + noncompliance) of 
the child (corrected reliability - .1+9). The reliability of the compliance 
ratio is not as high as' we might have wished, but it may still be high enough 
to be sensitive enough for powerful manipulations. We have been less for- 
tunate in obtaining good reliability scores on some other statistics import- 
ant to our research efforts. For example, the compliance ratios to specific 
agents (i.e., to mothers or fathers) have yielded rather low reliabilities. 
The reasons for this are two-fold: First, ratio scores are always less re- 
liable than are their componant raw scores, because they combine the error 
variance of both components. Second, and of more general importance, 
these scores are based on relatively few occurrences. On the average, for 
example, fathers give only 36 commands over the five-day period, ihese 
occurrences must then be divided for the coirpliances and noncompliances and 
further split in half for the odd-even reliability estimate. By the time 
this erosion ^akes place, there are few data points on which to base re- 
liability estimates. This problem is even more profound when we use one day 
of compliance ratio data to compute observer agreement on this statistic, 
since, on the average, fathers give only 7.2 commands per day. Thus, when 
we are dealing with behavioral events of fairly low base rate, observer 
agreement correlations and reliability coefficients may often not be 
"fairly" computed because there is simply not enough data. In classical 
test theory terminology, there may often not be enough "items" on the be- 
havioral test to permit an accurate estimation of the reliability of the 




Johnson and Bolstad 31 
score. What should we do with cases of this kind? A methodological purist 
might argue that we should throw out this data and use only scores with 
proven high reliability and observer agreement. We would argu that this 
course would be a particularly unfortunate solution for several reasons. 
First, low base rate behaviors are often those of special importance in 
clinical work. Second, if low reliability reflects nothing more than 
random, unsystematic error in the measurement instrument, it cannot jeopar- 
dize or provide a confounding influence on positive results (i.e., it cannot 
contribute to the commission of Type I errors). But, either low reliability 
or low observer agreement does have profound implications for the meaninp 
of negative results (i.e., the commission of Type II errors). Fortunately, 
the effects of many behavior modification procedures are so dramatic that 
they will emerge significant in spite of relatively low relial:>ili ty or 
observer accuracy. 

In one of the other few examples of reliability data in the behavior 
modification literature, Cobb (1969) found that the average odd-even re- 
liability of relevant behavioral coJes used in the school setting was only 
.72. Yet, Cobb (1969) found that the rates of certain coded behaviors 
showed strong relationships to achievement in arithMetic. Thus, relatively 
low reliability or observer agreement jeopardizes very little the meaning 
of positive results, but leaves negative results with little meaning. 
There is, however, one very critical qualifying point to this argument. It 
is that the error expressed in low reliability or observer accuracy must 
be random, unsystematic, and unbiased. With this consideration in mind, v/e 
now move to what are perhaps the most important methodological issues in 
naturalistic research--observer bias and observee reactivity to the obser- 
vation process . 
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The Problem of Obsorvor Biar> in NaturalifJtic Observation 
Shortly after the tuiTi of the century, 0. Pfungst became intrir,uoc: 
with a mysteriously clever horse named Hans. By tapping his foot, "Clv?ver 
Hans" was able to add, subtract, multiply and divide and to spell, read, 
and solve problems of musical harmony (Pfungst, 1911), Hans' ,wner, a 
Mr. von Osten, was a German mathematics teacher who, unlik^^ the vaudeville 
trainers of show animals, did not profit from the horse's peculiaz^ talents. 
He insisted that he did not cue the animal and, as proof, he permitted 
others to question Hans without his being present. Pfungst r^^^nriined in- 
credulous and began a program of systematic study to unravol tiio mystery 
of Hans ' talents . 

Pfungst soon discovered thot, if the horse could not see the questioner, 
Hans could not even answer the simplest of questions. Neither would Hans 
respond if the questioner himself did not know the answer. Pfungst next 
observed that a forward inclination of the questioner's head was sufficient 
to start the horse tapping, and raising the head was sufficient to terminate 
the tapping. This was true even for very slighr motions of the head, as 
well as the lowering and raising of the eyebrows and the dilation and con- 
traction of the questioner's nostrils. 

Pfungst reasoned and demonstrated that Hans' questioners, even the 
skeptical ones, expected the horse to give corre^"^ "^sponses - Unwittingly, 
their expectations were reflected in their head movements and glances to 
and from the horse's hooves. When the correct number of hoof taps was 
reached, the questioners almost always looked up, thereby signaling Hans 
to stop (Rosenthal, 1966^. 

Some fifty years later, Robert Rosenthal began to investigate the 
importance of the expectations of experimenters in psychological research. 
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In his now classical article, Rosenthal (1963) presented evi- 
dence suggesting that the experimenter's Knowledge of the hypothesis could 
serve as an unintended source of varirince in experimental results. In a 
prototypical study, Rosenthal and Kode (1963) had naive rats randomly 
assigned to two groups of undergraduate experimenters in a maze- learning 
task. One group of experimenters was told that they were working with maze- 
bright animals and the other group was told that their rats were maze-dull. 
The group of experimenters which was led to believe that their rats were 
maze-bright reported faster learning times for their subjects than the 
group which was told their animals were maze-dull. An extension of this 
finding to the classroom was offered by Rosenthal and Jacohson (1966). 
Teachers were led to believe that certain, randomly selected students in 
their classrooms were "late bloomers" with unrealized academic potential. 
Pre- and post-testing in the fall and spring suggested that children in tne 
experimental group (late bloomers) had a greater increase in IQ than did 
the controls . 

The purpose of this section will be to exainiiiO the problem of experi- 
menter-observer bias with regard to naturalistic observational procedures. 
The amount of literature which deals directly with observer bias in 
naturalistic observation is sparse (Kass & O'Leary, 1970; Skindrud, 1972; 
Kent, 1972). However, Rosenthal has written an extensive review of experi- 
menter bias in behavioral and social psychological research (Rosenthal, 1966). 
In spite of failures to replicate many of Rosenthal's findings (Barber & 
Silver, 1968; Clairborn, 1969) and extensive criticisms of Rosenthal's 
methodology (Snow, 1969; Thorndike , 1969 , Barber S Silver, 1968), the massive 
body of literature compiled and summarized by Rosenthal (1966) remains the 
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best available resource for conceptualizing the phenomenon of obr>orvor bias 
and for isolating possible sourcen of bias relevant to naturalistic ohr.er- 
vation. A brief review of this literature follows with a focus on inte- 
grating implications from this literature with naturalistic observational 
procedures. In addition, we will give consideration to the few experiments 
which have directly investigated observer bias in naturalistic observation 
and further consider some proposals for dxperiments yet to be conducted. 
Finally, suggestions for minimizing observer bias wall be outlined and data 
O'i this problem from our laboratory will be presented. 

Conceptualization of Observer Bias 

Rosenthal (1966) has defined experimenter bias "as the extent to which 
experimenter effect or error is asymmetrically distributed about the 
'correct* or 'true' value." Observer errors or effects are generally 
assumed to be randomly distributed around a "true" or "criterion" value. 
Observer bias, on the other hand, tends to be unidirectional and thereby 
confounding , 

Sources of Observer Bias 

An important distinction should be drawn between observer error and 
observer effect on subjects. Invalid results may be contributed solely by 
systematic or "biased" errors in recording by observers. Or, invalid find- 
ings may be realized as a result of the effect that the observer has on his 
subjects (Rosenthal, 1966). First we will consider recording error as a 
source of observer bias. 

Kennedy and Uphoff (1939) illustrate the problem of recording errors in 
an experiment in extrasensory perception. The observers' task was simply to 
record the investigator's guesses as to the kind of symbol being "trans- 
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mittod*' by the observer. Since the investigators guesses for ' .c obr.^'r^^ors 
hdd been programmed, it was possible to count the number of re- ording errors. 
In all, 126 recording errors out of 11,125 guesses were accumulated among 
28 observers. The analysis of errors revealed that believers in telepathy 
made 71.5 percent more errors increasing telepathy scores than did non- 
believers. Disbelievers made 100 p'^rcent more errors decreasing the 
telepathy scores than did their counterparts. Sheffield and Kaufman (1952) 
found similar biases in recording errors among believers and nonbelicvers 
in psychokinesis on tallying the results of the fall of dice. Computational 
errors in summing recorded rates have also been documented by Rosenthal in 
an experiment on tht^ perception of people (Rosenthal, Friedman, Johnson, 
Fode, Schill, White, & Vikan-Kline , 106U). 

It is doubtful that these recording and computational errors were in- 
tentional. However, as Rosenthal (1966, p. 31-32) notes, data fabrication 
or intentional cheating is not absent in psychological research, especially 
where undergraduate student experimenters are employed as data collectors. 
Rosenthal points out that these students *'have usually not identified to a 
great extent with the scientific vaJues of their instructors.*' Students 
may fear that a poor grade will be the result of an accurately observed 
and recorded event which is incompatible with the expected ev^^nt. Of two 
experiments by Rosenthal which were designed to examine i-^tenticnal erring 
by students in a laboratory course in animal learning, one revealed a clear 
instance of data fabrication (Rosenthal & Lawson, 196U) and the other showed 
no evidence of intentional erring but did show some deviations from the pre- 
scribed procedure (Rosenthal & Fode, 1963). Another study employing student 
experimenters by Azrin, Holz, Ulrich, and Goldiamond (1961) replicated 
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Verplanck's (1955) verbal conditioning experiment. However, an informal 
post-experimental check revealed that data had been fabricated by the student 
experimenters. Later, the authors employed advanced graduate students as 
experimenters and found that Verplcxnck's results were not replicated. 

The implications for naturalistic observation are obvious. Observer 
error, whether it be unintentional or in;:entional , incurred during recording 
or during computation, must bo guarded against by accuracy checks and by 
carefully concealing the experimenter's hypotheses. Although observer 
agreement checks do not rule out the possibility of bias among the ob- 
servers whose data is compared, it at least arouses suspicion where ap.ree- 
ment figures are low and disagreements pre consistent. Ideally, observers 
should not be made r3sponsible for the tallying of their own data. Compu- 
tations should be made by a nonobserver who is removed from knowledge of 
the observations. Observers should be selected on the basis of their iden- 
tification with scientific integrity and admonitions against p ssible 
biasing effects should be repeated during the course cf the ey>criment. 
Finally, observers should be encouraged to disclose to the experimenter 
both 6he nature and sources of any information they receive that might be 
relevant to the objectivity of their observations • A questionnaire, filled 
out after observation sessions, can facilitate this disclosure. 

The other source of observer bias, which Rosenthal discusses (Rosenthal, 
1966), is the effect of the observer's expectancy on the subject. If an 
observer has an hypothesis about a subject's behavior, he may be able to 
communicate his expectations and thereby influence the behavior* 

Expectancy effects have previously been alluded to in Rosenthal's 
study with animal laboratory experimenters (Rosenthal & Fode , I963) and 
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teachers in the classroom (Rosenthal E> Jacobson, i96^S). Pconthal's first 
major study in exoectar-^.y effects i:, ins t:ructivc in itb c^implicity . ^\\;^n- 
thal and Fode (1963) had 10 experi Tionters obtain ratings from ?06 subiocts 
on the photo person-perception task. All 10 experimenters received idonti- 
cal instructions <^xcept that five experiiTienters w^ro infomnod thcit thoir 
subjects would probabi »^ 'erage a +'> success rating on tho ten neutral 
pnotos while the othex e experimenters were led to ercpect d -5 failure 
average. The results repealed that che group given the +5 expectation ob- 
tained an average of +.U0 vs. the -5 expectation group v/hich yielded a 
-.08 score. These difference.^ were highly significant and subsequent repli- 
cations have supported these findings (Fode, 1960; lode, 1965). 

The implications for naturalistic observationaj. proccduroi, of tho ex- 
pectancy effect on the i>ubject*s behaviOi-^ are mosl oiscomrortiii^^ . li , a:, 
in the Rosenthal laboratory studies, observers in the natural .^otTinf, ^'an 
communicate their expectancies to their subjects such that the 5,ui)iect*s 
behavior falls 'n line with those expectations, a serious thr<*;at to interral 
validity is posed. Assuming that iiMma.is are no less sf^nsitivc to subtle 
cues than Mr. von Osten*s Clever Hans, it seems reasonable to infer thot 
observer expectancy effects are operative in xb^ nature! oettin^?. Consider 
the not atypical case of an observer who records '^.elo^.ted deviant behaviors 
of a chiln in a classroom before.^ duri .g , and after tre<itment. oejJom is 
it not obvious to the observer when treatment begins and ends. Assuming "^hat 
an observer might infer the expectations of the experimenter in sue!' a 
setting, how might he communicac<5 these e.^.pectations to his siJbjects? One 
way of influencing the targeted child is by nonverbal expressive cues. 
Expressions of amusement by the observer dv.ring baseline might inflate 
deviant behaviors. During intervention > exf/ressions of disappro\al or 
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caution by the observer might reduce the subject's deviant rate. These 
biasing effects may be systematic a'^d confounding. 

Although few studies have systematically assessed the effects of ob- 
server bias in the natural setting, many field investigators have taken note 
of the expectancy phenomenon, and have included procedures to minimize 
its effect. One such technique is to mask changes in experinental conditions 
(e.g., Thomas, Becker, S Armstrong, 1968). Another is to keep observers 
unaware of arsignment of subjects to various treatment or control conditions 
(e.g., O'Conner, 1969). The addition of new observers in the last phase 
of a study who are naive to previous manipulations is another approach (e.g., 
Bolstad & Johnson, 1972). 

Three studies in the natural setting shed further light on expectancy 
effects with naturalistic observational procedures. Rapp (1966) had eight 
pairs of untrained observers describe a child in a nursery school for a 
period of one minute. One member of each observer pair was subtly informed 
that the child under observation was feeling "under par" that day and the 
other that the child was "above par." In fact, all eight children showed 
no such behaviors. Sc/en of the eight pairs of observers evidenced signi- 
ficant discrepancies between partners in their description of the nursery 
children in the direction of their respective expectations. Both recording 
errors and expectancy effects on the subjects' behavior may have contrib- 
uted to this demonstration of observer bias. 

A second study by Azrin et al. (1961) employed untrained undergraduate 
observers who were asked to count opinion statements of adults when they 
spoke to them. The observations of those who had been exposed to an 
operant interpretation of the verbal conditioning phenomenon under study 
were the exact opposite of those given a psychodynamic interpretation^ 
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Again, both the expectancy effects of the observer on the subject and re- 
cording errors may have accounted for the observer bias. Post experimental 
inquiries by an accomplice student revealed that recording errors weio the 
main factor. The accomplice learned that 12 of the 19 undergraduates 
questioned intentionally fabricated their data to meet their expectations. 

A third study by Scott, Burton and Yarrow (1967) allows a comparison 
between the simultaneous observations of hypothesis informed (Scott her- 
self) and uninformed observers. The observers coded behavior into positive 
and negative acts from an audio-tape recording of the targeted child and 
his peers. The informed observer's data differed significantly from the 
others' in the direction of the experimenters' hypothesis. 

These three studies strongly suggest that data collected by relatively 
untrained observers are influenced by observer expectations. Do these 
findings generalize to the observations of professional observers who are 
highly trained in the use of sophisticated multivariate behavior codes? 
As indicated earlier, the amount of available research which directly per- 
tains to this question is limited and somewhat equivocal. 

Kass and C'Leary (1970) conducted the first syritematic attempt to 
manipulate observer expectations in a simulated f ield-experimental situation. 
Three groups of female andergr^aduates observed ide^ tical videotaped record- 
ings of two disruptive children in a simulated classroom. The obGervers 
were trained in nine category codes of disruptive behavior. Group I was 
then given the expectation that soft reprimands from tie teacher would in- 
crease the rate of disruptive behavior. Group II was told that soft repri- 
mands would decrease disruptive behavior. And, Group III was given no ex- 
pectation at all about the effects of soft reprimands. Rationales were 
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given each group explaining the reasons for each specific expr ^ation. The 
effects of these expectations were assessed by having the observers watch 
four days of baseline and five days of treatment data. The interaction 
between the mean rate of disruptive behavior in the three conditions and 
the two treatment conditions was significant at the .005 level, indicating 
the presence of observer bias. Ronald Kent (1972) has 

suggested that these reported effects of expectation bias were confounded 
with observer drift in the accuracy of recording. When different groups of 
raters, who are interreliable within groups, fail to frequently compute 
agreement between groups, they may "drift" apart in their application of 
the behavioral cods. However, it* should be noted that when this drift, 
comprised of recording errors, is alligned asymmetrically in the direction 
of the expectation, then the drift is, by definition, observer bias. 

Skindrud (1972) attempted to replicate the findings of Kass and O'Leary 
(1970). Observers were divided into three groups, each group given a different 
expectation about video-taped family interactions. The first group was 
given the expectation that when the father was absent there would be more 
child deviant behaviors than when the fachei' was present. A second group 
was given the opposite expectation. Appropriate rationales were provided 
for each of these two groups. An additional control group was added with 
no expectations provided regarxling f ather-present or father-absent tapes. 
All observers were checked at the end of training on the rates of deviant 
behaviors they recorded and subsequently matched on this variable when 
assigned to conditions. Throughout the study, observer agreement data was 
collected randomly. During training, reliability was checked daily, and the 
average observer agreement prior to the beginning of the manipulation was 6U%. 
The results of the study gave no evidence for observer bias. There were no 
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significant differences between groups and no significant interaction effects. 
There was little drift in the accuracy witii which the code w<i used. Se- 
quential reliabilities were computed for the increase, decioise, and control 
groups with average observer accuracy of 58.5%, 57.6%, and 58.4%, respec- 
tively. These accuracy figures were computed by comparisons with pre- 
viously coded criterion protocols.. The relatively small and r rrdstent 
decline in accuracy is consistent with the failure to find hi. . 

A similar unsuccessful attenpt lo >epJicate Kass and 0*L^siry (1970) 
was reported by Kent (1972). Kent rouud tii<U knowledge of predicted results 
was not sufficient to produce an obs<^rv<r i ias effect. However, when 
the experimenter reacted positively to ciaia uhicii Wcis consistent with the 
given predictions and negatively Lo nc 'usi.acMt data, a significant ob~ 
server bias effect was obtained. 

The available literature dealin^; with oiuuu^vor bias in naturalistic 
observation is both sparse and con t r^d i rt ory . Furthermore, the few studies 
available have focused exclusively on only one soui'ce of observer bias, 
namely, recording errors or errors ot apprehension. Thus far, no one has 
systematically investigated the cfl^^cts ot" the observer's expectancies on 
the subjects* behavior. In the thrf*-' s1udi<^s reported above, all observa- 
tions were made from video- toped r'^cordings. There were no opportunities 
for the observers to communicate the'r » xpoctancies to their subjects. 
Yet, in most studies employing naturalistic observational procedures, 
observers do have that opportunity. 

An important study which needs to be conducted is one which examines 
the observer's expectancy effects on the subject. First, it would be 
interesting to determine if observers could nonveroally communicate their 
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expectancies to subjects such that rhe subject's behavior changes in the 
direction of the expectancy. The next step, of course, would be to repli- 
cate this same design without specifically asking observers to attempt to 
influence subjects, but merely to give them an expectation. 

Perhaps the most important test of observer bias effects will be the 
one which combines recording errors and effects of observer expectancy on 
subjects in the naturalistic settin^^ One can question the generalizability 
of highly controlled laboratory stud^'es to live observations and to research 
projects in which the observers are inoro invested in the outcome of the 
research. The generalizability ol ,s udi-^s which employ only taped versions 
of a subject's behavior is further iimilccl by excluding the possible effects 
of an observer's expectancy on his subject'^ behavior. 

Another variable which scottig cmicial to ob:,orver bias in the natural- 
istic setting is the observer's re, ponr,i voness to admonitions to remain 
scientific, objective, and 'nipartial Ln the collection of data, Rosenthal 
(1966) stresses the importance of the experimenter-observer's identification 
with science and objectivity. He cites evidence suggesting that graduate 
students obtain less biased data than ui. iergraduates and interprets this 
difference as a function of identification with science. Perhaps observers 
who are repeatedly reminded to be impartial might be less susceptible to the 
influence of biasing information than observers not given these admonitions. 

A dimension which seems important in considering observer bias is the 
specificity of the code. In most of the Rosenthal literature, the dependent 
variable is scaled between such global poles as success and failure. In- 
tuitively , it seems logical that the more ambiguous the dependent measure, 
the greater the possibility for bias. A multivariate coding system, with 
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well-defined behavioral codes might be expected to restrict interpretive 
bias. This is an empirical question worthy of examination. 

Another variable which might greatly affect observer bias is observer 
agreement. The greater the observer agreement, the less likely is observer 
bias, even among observers with the same expectancy. 

Unt51 more information is available on observer bias effects in natural- 
istic observation, it seems very critical to do everything possible to mini- 
mize the potential for these effects. Whenever possible, observers should not 
have access to information that may give rise to confounding consequences 
and encouraged to reveal the nature and source of any information they do 
receive. In our research, we are currently observing both families in 
clinical treatment and "normal" or nootreated families. Knowledge of a 
family's status might seriously affect tlie observer's data. Also, knowledge 
about treatment stages (baseline^ mid^lreatmcnL ^ post-treatment, and follow- 
up) might effect the observers' data. After each observation, it is our 
policy to have observers fill out a questionnaire concerning the nature and 
source of any biasing information. Thus far, of 75 observations of referred 
families, observers have considered themselves informed only 36% of the time. 
And, in all of these cases, their inCormatJon was correct. This information 
usually comes from a member of the family being observed (56%). Other sources 
of information include information leaks from the therapists (12%), the 
Child Study Center Clinic generally (16%), and other sources (16%)* Of the 
observer considering themselves informed as to the clinic vs. "normal" 
status of the families, 29% also considered themselves informed as to treat- 
ment stage, but only two-thirds of these observers were correct in thrir 
discrimination. In only 20% of the cases did the observer actually know 
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the status of the case (i.e., clinic vs. normal) and the treatment stage 
(baseline vs. after baseline). Of the observers considering themselves 
completely uninformed of the families' status, their guessing rate (clinical 
or "normal") barely exceeded chance at 51%. Their guesses as to the four 
stages of treatment were 36% correct and 80% correct on the discrimination 
between baseline and after baseline. 

Of the "normal" families seen, observers hc.vc considered themselves 
informed as to family status only 17% of the time. However, in only 45% of 
these cases were the observers actually correct in making the discrimination. 
In the uninformed observations, however, observers were able to guess the 
family's status correctly 75% of the time. 

Not only are these questionnaire-; brneficial in gauging the amount of 
potentially biasing information that observerr discover, but they are help- 
ful in two other ways as well. Firs;t, b^ n /ealjng sources of information 
leakage, steps can be made to eliminate those sources. Second, question- 
naires, given after each family is obser ed, serve as a regular reminder 
for the importance of unbiased, objective recording of behavior. 

It ie difficult to make any firm conclasions about the presence or 
absence of observer bias in naturalistic observation. Clearly, more research 
is needed on this question. However, it should also be clear that the poten- 
tially confounding influence of observer bias cannot be ignored and that 
steps can and should be taken to minimize its possible effect. 

The Issue of Reactivity in Naturalistic Observation 

In the previous section, we have considered the effects of an observer's 
bias in naturalistic observation. In thi3 section, we will discuss the 
effect of ti^e observer's presence on the subjects being observed. Whereas 
observer bias can potentially invalidate comparisons by confounding in- 
fluences, the reactive effects to being observed primarily constitute a 
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threat to the general: zahili ty of Uio iindings. That is, subjocts* oi)Scrved 
behavior in the natural Gettin^^, may not general i?.e to their unohr,orvc'd 
behavior. Webb, Campbell, Schwartz, and Sechrost (1966) have defined reac- 



change the behavior of the subject. Woick (1^^68) has also referred to 
reactivity as ^'interference" or the i ntrudvoness of the observer himself 
upon the behavior being observed. Clearly, situations which are highly 
reactive in terms of "observer eff»M't:3" are not likely to be generalizable 
to situations in which such effoct:. arc al'^^ent. 

Reactive effects have been :Uud'od ^i:h two basic paradigms: a) by 
the study of behavioral stability over tim^' and h) by comparison of the 
effects of various levels of obtras i vencn:; rr .lie observation procedure. 
In employing the first method, inv(.' >1 ii;ators Iimv^' typically examined be- 
havioral data for change over tim^^ in the median level and variance of the 
dependent variable. In general, it has hem assumed t}iat change reflects 
initial reactivity and progressive adaptation to being observed. This inter- 
pretation is particularly persuasive if t.hcre is an obvious stability in 
the data after some initial period of chan or higii variability.. While 
this is a viable way of checking lor react j^-ity effects, it is a highly 
indirect method and relies on assumptions concerning the causes of observed 
change. It is obvious that other processes could account for such change. 
Furthermore , the lack of change certainly noes not indicate a lack of reactive 
effects. The second method, comparing obtrusive levels of observation, 
appears less inferential than the f irst method. The problem with this method 
is that it only provides a picture of relative degrees of reactivity between 
obtrusi veness levels; it does not provide a measure of the degree of reac- 
tivity relative to the true, unobserved behavior. However, this problem can 



tivity in terms of measurement procedures which influence and thereby 
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be remedied if one of the observational treatments in the comparison is 
totally unobtrusive or concealed. 

To what extent does reactivity occiu- in naturalistic observation? The 
literature addressing this question is commonly reported in reviews to be 
contradictory (Wiggins, 1970; Weick , I968; Patterson & Harris, I968). 
Several studies have been cited as providing evidence for the position that 
reactive effects may be quite minimal. Others have been cited which suggest 
that reactive effects are quite pronounced. The purpose of this review is 
to: a_) reconsider the contradictions in the literature on reactivity, b_) 
tease out those factors which seem to account for reactivity, and c_) pro- 
pose further investigations which isolate these factors. 

In a number of reviews on reactivity, several studies have been con- 
sistently cited which support the position that reactivity does not consti- 
tute a major threat to generalizability . One study freojAently cited is the 
timely investigation of a Midwest community by Barker and Wright (l95!>). 
In this admirable study, careful naturalistic observations were made of 
children under ten years of age and their daily interac"Mcno with peers 
and parents. The authors assumed that reactive effects were short lived 
and that the adults and other members of the families quickly habituated 
to the presence of the observers. In addition, it was reported that, 
with the younger subjects in the sample, reactive effects were slight. 
However, these findings should be interpreted with much caution. What 
is easily lost sight of in the summaries of this work is that the observers 
in this study were free to interact with the subjects in a friendly but 
nondirective manner. In fact, the basis for the authors' conclusion 
that reactive effects were not pronounced was the 
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finding' that "only" 20^ of the chi^ir(^^l^^ behavior.il interdct L..n . wr-o with 
the observer. Allowing the cls^^rv^^-^ to intoract with tho subject mu.' t 
certainly have increased the ihtru^j vcne^o of the observer and provi i^d the 
oppor^.uniry for the observer to iniluence the subj'^-cl'r, behavior. The 
authors' other conclusion that r<vi(tivity, ds nrieasurea by frequency of int^jr- 
actions, positively corielated v^iti* <^\q i:> also suspect in t^at children 
below the age of five were not always imormod that they were being observed, 
whereas children above this age wen^ . 

Another study commonly cited in support of the minimal react ivJtv 
position is that of Ba^es (1950). [n this controlJed laboratory invostigatio; 
the behavior of a discussion group was not found to be changed by three 
levels of observer conspicuou^sness . This finding, tiowever, may be limited to 
the laboratory setting. 

Two additional studios, iv^^.' i ; iv i len t i ri..-<i ,r :.,uppox^l Lvo of t[i- mini- 
inal reactivity argument, mad* n f>i raiMo tr-.n^mittiT recording In the 
naturalistic environment. "r,;kin .md Jol;n (J^nO) tiad a married couple wear 
a transmitter the entire time they were on i two-v;eek vacation. Purcell 
and Brady (1965) outfitted adolescents in a trcatinent center with a similar 
recording device for one hour a day. Wlien the protocols in both studies were 
examined for the frequency of comments about being observed or listened to, 
it was found that such references- declined to a zero level either durin/^ the 
first or second day of recording. This is rot to say, of course, that these 
subjects were not still aware of, and affected by, the recording device; the 
results only indicate that the subjects talked about the device less after 
the first day. 

A recent investigation by Martin, Geifand, .nd Hartmann (1971) can 
also be interpreted as providing evidence for low levels of reactivity to 
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observation. This stvKly involved 100 olomntary .school ciiil. Iron, -if^c > \ 7. 
Equal numb(;rs of m>\le ciiui ienjalo i.u:.-iect-, were a: ", irned to fiv<> ol»s(>rv<it ion 
conditions following expo sure to an ar,Ci'^^ssi ve mod^>l: a) observer .ib: ent , 
b) female adult observer present, c) male adult abserver present, d) 
female peer observer present, and e) male peer observer present. During 
the free-play session, the subjects » aggre' ;ive behavior wd:: recorded by 
observers behind a one-way mirror. No r:vniiicant differences in agrr-ssive 
behaviors were obtained between th< observer-prer^ent and observer-abs^^nt 
conditions. The absence of differences between these twvo levels of o: tru~ 
siveness in observation suggests little r,v no re.ioiivi ty io tiie presence of 
an observer. Within the oh.sorver-i-,-. :...n* '^ud'wu-u. however, wa.. u^und 
that peer observers signilicantly ia<nliiafed imitative <iggro-.si v.^ r<-^pon(ling 
in both boys and girls comparted to .uaiW o.. crvers. Also, there wa'> i..r,ro 
imitative aggression when the observer was thf^ sanie sex as the suhjeci. The 
girls, but not the boys, showed significant incre.ises in agri'<^"r.i vo output 
over time when the observer was present but not when the observer was absent. 
This latter finding suggests that girls manifest initial reactivity to 
the presence of an observer but later hal^iluate to the observer's presence. 
It is interesting that both paradignis for n.easuring reactivity were used in 
this investigation and that each method supports different conclusions ai^^out 
the degree of reactivity. In considering the generalizability of these 
findings to naturalistic observation procedures, it should be noted that 
observers in this study were instructed to not pay attention to the subjects 
and were either seated facing away from the subjects (adult observers) or 
given a coloring task to complete (peer observers). With naturalistic ob- 
servation procedures, on the other hand, observers typically pay very close 
attention to their subjects. 

ER.1C i^-t 



ovi(]M,.o f^i rinirvii ;f >-,^. v.- t^^ '..ivttirn navo 1 -en ^ i on 

dat.i qu- liorMl.- 01.^^,1 'rict^. to :.i;:hlv ^i'Gcili^' cir^UT.- 

stance:, (c.>;., Baio:_ l^tO; y,^v\\:: ,1., 

Many other studie:: hcive 1 o- :i 'tr.! ... 1. r.n iratinr, consIaoi Mi^ 1- r-jc- 
tive ef fcv tr, of oi)::prv-i i ior. ii. i.^^t .i^ i ' - • i ( *:tir;r-. One . ucli is 
tfhil of Polc^nsky, •-ror,!-, t\or . Ir., m, i , i- ippap'-n^i , .ind Ti^^v 

(rj^9). Tlieno inv'>sti r,at ors c; m*V' ! * I'lvri-^u^ c >w iJron in a study o: 
grou[> emotional conta['/:c:i phenoi:i*^n-'i i ' Mron rti informed t\\<\t the 

obcervers wt-re studying thoir >*acti t> varinu' ispect:. of the MiPin.^^r-cdmp 

prograjn. The authors report tt..:t d':!;i.s i 'i. lirst w^.-k ol ohiit^rvat i ons , tne 
children e'Jijen t ially i^Hi(;rod ri.^* i:- . 



Gocona week, many "blow-up: " o mi: 
ecpecialJy by tlie older .hllcinn. 
ncos of the children can in- k >\ id 
observed. They also cc^nc' !♦ . i . 
confounded by '*tl^i<- secon.l m *. ;i ■ 
creasing anti-aduit aggj'cs.s i v. n» 



1 f h^' { '.uh^r ». hilt , durinr in*' 
. v<>r^' diiv'Cted against t b' cc)i»»rs, 
J P^ .i<Jtt^ that thie a>'i'!^ess ive- 
i ^ , a. 1 ' • ^ Is tunce to rei 
' '1 ' ' T i ■ . t anc^ " nyp(;t hen i s was 
'I b 1 1 • \ tlcscr i ! a^ an i n- 

li-iv ^'voives alter the children 



have adj\istt d to the camp, pi^aim/ : • < ^v>n<i we^-K it ]s uricle<ir as 
to what to conclude fmn. this Jtui, :i iit > 'S:. t i vi t v . W<is re.ictivit\ ir.ost 
pri^vaient when children w^-r o i,^;. \( , ^^i i th,. 01 ' erver* in tiie s»'-ond 

week? Or, was reactivity thm:, t in\>s' :\. :.i:s^' ih*~ first weel' when tli»- 
deiinqijr>nt children we>'^' "suppr-: : " <n'>f'sion prior to rial i ts.jt i np to 
the unfamiliar environ'ront . An, ,',^^v^ imr^ rtantly, how rnsrh sf this patter., 
of cooperation in the first week and anti-^dult aggressi voness in the second 
would have occurred in the absence of observers? This question is left 
unanswered by the Polansky et al. (19U9) study. 
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A more ophi rati n«i+-e i i'/.uMy .a iurtr^tirig :onr. i 'raL] o rf*uct,i\w' ''r»'*^ctn 
is that of Poberts ond Rcn/'.aglia {\oO;) , in this rtudy, <'ip;ht ih*'r:ipi ^t: 
saw two "cli ents*' (sturlentr> solicits i from an int ro(iurtor>' pi^ycho] o^::y rour, (0 
for three seGslons. Tnc three conditions w.jrc : i^) vith a tape rc^cor irr in 
the room, b) inicrophoriO only, c_) with a ccncer^le-.I microphone and ;i co'ceab'd 
recorder, '^nile no differs ncer. were found as to Xho niimber' of miniiN\' that 
subjectr, talked in eacli condition, it was found t'nat t.he clientr. mad'^ nore 
favorable self-referenceG when they knew th*ey wcr<* being recorded (conditions 
a and b^, and more unfavorable self-roferencoi.: when they did not r(^alir.>' 
they were beinp, recorded (condition c_) . V,^ule the p;enerali zabi ] i ty of this 
study is limited by the nmaii n :in>i ^-.tructured :;ituat-ion employ(»d. It ioe:» 
provide evidence for reactivity efioctc. 

'Hie PechLt;! {i9C'i) Lit.U'iy ha.s \1 o b'^en pr«rviit<>a in review, ai'. • luiy 
demonst ratinr; considerrib i-^ r^^ad i vt fre^-t;,. IvchlM studit-v^ iho mov^ T-^-nts 
of people in an art musouri with 'iwtomatei rr-'j^Mtnr, devir^e, t'aiie i :n 
hodometer. In one conditir)n, Liuh/i --^ wo!'(/ re ^lio/ 1 ed to p,o throUf,h th^' art 
exliibit in one room and r'ijik tii^ ; rintj; ac;^ordin^^ preference. Anoi^ner 
group was given the same instruction.: ana aiso informed that they were being 
observed. ?"or the first group, i^ubjectr, wjre found to ] eav<, the room with 
an avera^.e latency of 313 J'econdc , covt,>rin^^ an average :^ ^^ea of T6 square f'^et. 
The second group left after 220 seconds and covered only '^8 square feet, 
Bechtel concludes that being observed constituted, an ave^r.ive situation which 
led to escaping the room fas'^er. No statir-tics were provided to determine 
whether or not the reported averages were significantly different, nor were 
the number of subjects in each condition reported. Unfortunately, this kind 
of study does not provide any information about htJitaation effects, since 
subjects were allowed to leave the setting. 
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ment. This article is the only study available which was designed specifi- 
cally to manipulate and measure observer effects in the homes of the 
families observed. In this study, data obtained from mothers on their own 
families were compared with the data on the same family collected by an 



condition: a) mothers collected the first five ten-minute sessions of 
observational data and an outside observer collected the second five sessions 
of data on the child and father only (M-0), b^) the observer collected all 
ten sessions as a test for habituation effects (0-0), and £) the mothers 
collected all ten sessions as a control for habituation effects (M-M)., The 
dependent variables were the rates of total behaviors and the rate of defiant 
behaviors. A problem in the researcn design of this study should be noted. 
The mother was present in the fanily as a part icipant in the second condition 
(0-0) and the second half of condition a^ (M-0), but was not a participant 
when she was an observer in condition c and the first half of condition £. 
These comparisons are confounded by mother presence and absence. In spite 
of this confound, which would probably bias in favor of showing group dif- 
ferences , no main effects for groups were found in analysis of variance for 
either the rate of total interactions or deviant behaviors. Thus, on the 
initially selected dependent variables, no reactive effects were apparent. 

Patterson and Harris also divided their groups into high and low rate 
interactors on the basis of the first five sessions. On the frequency of 
total interactions measure, high rate interactors in the first five sessions 
showed significant reductions in rate during the last five sessions. The 
authors describe this decline as a "structuring effect" in that the subjects 
appeared to program some activity together in the first five sessions. 



outside observer. There were three conditions, with five families per 
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Conversely, the low rate interactors in the first five sessions showed slight 
increases in rates during the last five sessions. The author:, describe this 
transition as an habituation effect in that subjects initially involved 
themselves in solitary activities or attempted to escape the observational 
situation but later adjusted to it and interacted more. In general, there 
were no change<= in deviant behavior from the first set of five observations 
to the last set of five. The only significant finding was that subjects who 
displayed low rates of deviant behavior in the first five sessions (under 
the M-0 condition) increased their rate in the last five sessions. However, 
it is possible that the mothers were rccoriling less deviant behaviors and 
more positive behaviors in the fir.;t five sessions than v:ere the observers 
in the second five sessions, thus co:)tri' uting differentially to main trials 
effects. An observational study by Rosenthal (1966) supports such a thesis. 
He found that parents tended to code inor ■ pc>^itive changes in their children 
than were actually present. And, Peine (1'.'70) found that parents were less 
observant of their children's deviant beriaviors than were nonparent observers. 

Patterson and Harris conclude that ^'generalizat ion about 'observer 
effects* should probably be limited to special classes of behavior " (p. 16). 
A more recent study by Patterson and Cobb (1971) analyzed the st<iDility of 
each of the 29 behavior codes used in their coding system. If it is assumed 
that individuals adapt to the presence of an observer over time, then a 
repeated measures analysis of variance should reveal differences in the mean 
level of various behaviors. Patterson and Cobb analyzed data for 31 children 
from problem and nonproblem families over seven baseline sessions. None 
of the changes in mean level for the codes produced a significant effect 
over time. The investigators conclude that the observation data were 
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fairly stable for most code categories. It is possible, of course, that had 
observations continued over a longer period of time, significant changes in 
mean level for some behaviors would have been discovered. Given that fam- 
ilies were rarely observed on consecutive days by the same observer, it is 
possible that different observers could have resensitized the families each 
day, thereby extending the period required for adaptation. 

In summary, there are a few well-designed studies which have discovered 
reactive effects (e.g., Roberts and Renzaglia, 1965; Bechtel, 1967; White, 
1972), but there are several others where the meaning of the results is 
unclear. There can be little doub" "t"hat the entire question has been in- 
adequately researched. Any general conclusions about the extent of reac- 
tivitv in naturalistic observation would seem premature at this time. 

As White (1972) points out, the finding of reactive effects seems to 
depend on many factors, including the setting (e.g., home, school, labor- 
atory), the length of observation, and the constraints placed on subjects 
by the conditions of observation (e.g., no television during observations, 
remain within two adjacent rooms, etc.). Furthermore, it should be realized 
that reactivity may or may not be discovered depending upon what paradigm of 
measuremf^nt is used (e.g., Patterson £ Harris, 1968; Martin et al., 1971) 
and what variables are analyzed as dependent variables (e.g., Roberts S 
Renzaglia, 1965; White, 1972). Unless these factors are controlled for in 
comparing experiments on reactivity, both contradictions and consistencies 
as to the relative presence or absence of reactivity may falsely appear. 

Assuming that reactivity to being observed in naturalistic settings 
does occur, even if only to some minimal degree, the critical task is to 
localize the sources of interference so that they can be dealt with more 
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directly. Four such sources will be discussed and experiments will be pro- 
posed to measure the extent of their instrusiveness. 
Factor 1_; Conspicuousness of the Observer 

The literature points to the level of conspicuousness or intrusiveness 
of* the observer as an important factor contributing to reactivity. Pre- 
sumably, the more novel c-^A conspicuous the agent of observation, the more 
distracting are the effects upon tlie individuals being observed. It would 
also follow that longer habituation periods would be required for more dis- 
tracting observational agents in order to achieve stability of data. 

Bernal, Gibson, William, and Pesses (l97l) compared two observation 
procedures which would presumably vary on obtrusiveness . These investigators 
compared data collected by an observer with that collected by means of an 
audio tape recorder which was switched on by an automatic timing device. The 
family members involved in this study were aware of the presence of the 
recorder but were unaware of the exact tim.-j of its operation. The primary 
purpose of this study was to explore the feasibility of the audio tape 
method and to explore the relationship of data collected by the two methods 
rather than to study reactivity £er^ se. The results indicated that, during 
the same time interval , there was a high relationship between the mother's 
command rate as coded by the observer and from the tape (r = .86) but that 
the observer coded more commands. Similar results were obtained when the 
observer's data was compared with data based on coding of the audio tapes 
from different time intervals . The questi-n arises as to how much of this 
latter discrepancy was due to differences in levels of reactivity and how 
much was due to differences associated with the source of coding. The 
authors point out, for example, that the observer could code gestural 
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commands while the coder using the tape could not. Since the discrepancies 
at the same time and at different times were of the same general order of 
magnitude, it is likely that most of the observed difference across time 
was due to the material on which coding was based rather than to differences 
in subject reactivity. To study the impact of reactivity effects separately, 
one might design such a study so that the same stimulus materials would be 
used for codint^. 

We are currently completing a study on reactivity which employs this 
strategy to compare reactivity asoociated with an observer prese*^t in the 
home carrying a tape recorder vs. the tape recorder alone. This study 
involves six days of observation for ^5 minutes per day with single-child 
families. The two conditions are alternated so that the observer is present 
one evening and not present the next. The observer is actually a "bogus" 
observer. All behavioral coding is done on the basis of the tapes. It is 
our suspicion that reactivity to the tape recorder will be short lived and 
minimal compared to the reactivity associated with the observer present. 

If these hypotheses are substantiated in this and other research, 
alternatives to having an observer present in the home should be explored. 
One solution to be seriously considered woxild be extended use of portable 
video or audio tape recording equipment, lliese recording devices could 
remain in the homes over an extended observation period to facilitate habi- 
tuation effects. In addition, the devices 20uld be preprcgrammed to turn 
on and off at different times during the day so that the observed would not 
know when they are in operation (as in Bernal et al. , 197l). This solution, 
which would, of course, require full knowledge and consent of the parties 
involved, appears to be a promising one for attenviating reactivity effects 
as well as solving problems of observer bias. 
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Factor 2_: Individual Differences of the Sub.jects 

Some people might be expected to manifest more reactivity to t,he presence 
of an observer than others. A "personality" variable such as guardedness 
might be correlated with degree of reactivity. For example, scores on the K 
scale of the MMPI (or other comparable tests) might be related to the effects 
of being observed in a natural setting. 

The literature also suggests that age is correlated with reactivity. 
Several authors (Barker & Wright, 1955; Polansky et al. , 19^9) have suggested 
that younger children are less self-conscious and thereby less subject to 
reactive effects than older children, ^e Martin et al. (l97l) study also 
suggests that sex might be an important factor accounting for different 
levels of reactivity. Experiments are needed which compare these individual 
difference variables in the natural setting with naturalistic observation 
procedures . 

Factor 3^:, Personal A ttributes f the Observer 

Evidence from semi-structured intervie^r, suggests that reactive effects 
may also be contributed by the unique attrioutes of the observer. Different 
attributes of the observer may elicit different roles on the part of the 
subject, depending upon what might be appropriate given the observer's attri- 
bute. Rosenthal (I966) reports several such attributes that have been 
demonstrated to yield differential effects, including the age of the observer. 
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sex, race, socio-economic class, and the observer's professional statur; (i.e., 
undergraduate observer vs. Ph.D. therapist). Martin et al. (1971) aJso dis- 
covered that both the factors of age and sex of the observer had differential 
effects on the subjects being observed. Varying any of these dimensions 
parametrically would be relatively simple in investigating this problem in 
the natural setting. 

Factor U : Rationale for Observation 

Another factor that may be important in accounting for reactivity is 
the amount of rationale given subjects for being observed. Whereas the 
Bales (1950) study found no differential reactivity of three levels of ob- 
server conspicuousness in a group-discussion setting. Smith (J957) found that 
nonparticipant observers aroused hostility and uncertainty among partici- 
pating group members. Weick (1968) suggests that this discrepancy may have 
been a function of different amounts of rationale for the presence of an 
observer. We hypothesize that a thorough rationale for being observed might 
be expected to reduce guardedness, anxiety, etc., and thereby reduce the 
reactivity. 

Observer reactivity is a problem that cannot be easily dismissed for 
naturalistic observation. There is sufficient evidence to suggest that ob- 
server reactivity can seriously limit the generali:t.ability of naturalistic 
observation data. CJearly, factors accounting for reactivity need to be 
investigated and solutions derived to minimze the effects of the observer 
on the observed. In the next section, we will describe how reactivity, in 
addition to posing a problem for generalizability , can also interact with 
and confound the dependent variable. 
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Observ^^e Bi<..s : 
Demand Characteristics, Ro:;ponc;e Sets and Fakability 
Reactivity to observation will always be a problem for naturalistic 
research, but it would be a relatively manageable one if we could assume it 
to be a relatively constant, noninteractive effect. That is, if we knew 
that the presence of an observer reliably reduced activity level or deviant 
behavior by 30%, for example, the problem would not be too damaging to 
research investigations involving groups of subjects. But, what if the 
observe e»s reactivity to being observed interacts with the dependent variable 
under st'^dy. 

Let us take the example of a tro itmcnt study on deviant children in which 
observations are taken prior to and after iredtment. Prior to treatment, the 
appropriate thing for involved parent ; or reachers to do is to make their 
referred child appear to be ^levjant in ord -r to justify treatment. The 
appropriate response at the end of treatment, on the other hand, is to make 
the child appear improved in order to just fy the termination, please the 
therapist, etc. These are the demand char^.cteristics of the situation. In 
this case, the reactivity to being observea is not constant or unidirectional, 
but interacts with and confounds the depencent variable. It is possible that 
any improvement we see in the children's be navior is simply the result of 
differential reactivity as a consequence of the demand characteristics of the 
situation. Now, let us suppose we employ c wait list control group and 
collect observational data twice before beginning treatment and at the same 
interval as used for the treated group. This procedure provides an excellent 
pretest-post-test control for our treated group. But, what of the demand 
characteristics of this procedure? On the first assessment, the involved 

(0/ 
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parents or teachers will probably behave in the same general way as their 
counterparts in the treated group, but by the second observation they may 
be more desperate for help and even more concerned to present their child 
as highly deviant. Thus, simply as a result of the demand characteristics 
involved, we might expect our treatment group to show improvement while the 
control groups would show some deterioration. 

We also may wish to compare our referred children with children who 
are presumably "normal^' or at least not referred for psycholopaca] treatment. 
Once again, however, we might anticipate that parents recruited for "norma- 
tive" research on "typical" families would be more inclined than our parents 
of referred children to present their wards as nondeviant or r^od. In 
other words, a response set of social desirability could br> operative with 
this sample making them less directly comparable to the referred sample. 

These arguments would, of course, be even more persua.sivc if we were 
dealing with the observed behavior of the adults themselves. The foregoing 
observations on children assume, however, that the involved adults are 
capable of influencing children to appt-cr relatively "deviant" or "normal" 
if they wish to do so (\.e., that observational data on children is poten- 
tially fakable by adult manipulation). 

We have just completed a study (Johnson S Lobitz, 1972) which was 
directed at testing this assumption. Twelve sets of parents with four- or 
five-year-old children were instructed to do everything in their power to 
make their children look "bad'' or "deviant" on three days of a six-day home 
observation and to make their children look "good" or "nondeviant" on the 
remaining three days. Parents alternated from "good" to "bad" days in a 
counterbalanced design. 
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Four predictions were made regarding the behavior of both children 

and parents. During the "fake bad" periods, it was anticipated that, 

relative to the "fake good" periods, there would be: 
a^) more deviant child behavljrSj 

b) a lower ratio of compliance to parental coiwnands , 

o) more "negative" responses on the part of parents, and 

d^) more parental commands. 

Predictions a, £, and d were confirmed at or beyond the .01 level of 
confidence. Only the child's compliance ratio failed to be responsive to 
the manipulation. will be recalled from the section on reliability that 
this statisT^ic is by for the least reliable and t:hus the least sensitive 
(statistically) to manipulation. These results which demonstrate the 
fakability of naturalistic behavioral data indicate that this kind of data 
may potentially be confounded by demand ch.aracteristics and/or response sets. 

We are aware of only one other study involving naturalistic observation 
which helps demonstrate this problem (Hortcn, Larson, & Maser, 1972). This 
study involved one teacher who was under tlie instruction of a "master" 
teacher for the purpose of raising her classroom approval behavior. She was 
observed, without her knowleoge , by students in the class. The results 
clearly showed that her approval behavior vas at a much higher rate when 
she was being observed by the "master" teociier than when she was not being 
observed. Generalization from overtly observed periods to periods of 
covert observation was very minimal indeed. More generalization was found 
when the "master" teacher's presence in the classroom was put on a more 
random schedule. This study is not completely analogous to most naturalistic 
research because, in this case, the observer and trainer were the same person 
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and the study is limited in generali zabili ty because of the N = 1 design. 
Yet, in most cases, the observed are aware that the collected observational 
data will be seen by the involved therapist, teacher, or researcher, and 
if the problem exists for one subject, it is a potential problem for all 



observation. Thus, the potential solutions outlined in the previous section 
apply here as well. In general, we suspect that observation procedures which 
are relatively unobtrusive and which allow for relatively long periods of 
adaptation will yield less reactivity and observee bias . 



Just as behaviorists have ignored the requirement of classical roli- 
al^iiity in their data, they r.ave also negL ded to give any syst«*matic 
attention to the concept of validity. Most research investigations in the 
behavior modification literature which havf employed observational methods 
have relied on behavior sampling in only one narrowly circumscribed situ- 
ation with no evidence that the obs^-^rved behavior was representative of the 
subject's action in other stimulus situations. In addition, behaviorists 
have largely failed to show that the obtaired scores on behavioral dimensions 
bear any relationship to scores obtained or the same dimensions by different 
measurement procedures. This fact calls in ro serious question the validity 
of any of this research where the purpose has been to generalize beyond the 
peculiar circumstances of the narrowly defiiied assessment situation. Of 
course, the methodological problems wr nuve presented thus far all pose 
threats to the validity of the behavioral scores obtained. But, we would 
argue that even if all these problems could sjomehow be magically solved, the 
requirement for some form of convergent validity would still be essential* 



subjects. Observee bias is really a special case of subject reactivity to 
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As with rc?] i ,ihi lity , thorr^ ar*^ manv ^iifforent indhods of validation, hut as 
Campbell and Tisko (l^^S^O pr^Int o\if 

Validation is typically convc^^ent ; a conf irmrition by inUfi Midont 
moasuroment procedures. Indep(>ndence of nicthodG is a cominor. 
denominator among the major types of validity (excepting content 
validity) insofar as they arc to be distinj^uished from reliabil- 
ity. . . . Reliability is the agreement between two ffo-ts to 
measure the sarrre trait through maximally similar method<^. Val- 
idity is represented in the af^i-^ement between two attrnpts to 
measure the same trait through inaxiirally different metii^^is. 

Thus, convergent validity is established when two dissimilar methods of 
measuring the same variable yield similar or correlated result Predictive 
validity is established when the measure of .-i behavioral dimen ion correlates 
with a criterion established by a dibrimilar mf^asuroment instr:nent. 

'with only a few exceptions, b(\havi ori 3tn have m>trJcted ' iicniso iv(\-. to 
face or content validity. And, of rourno , it must be admitted that tho 
face validity of narrow ly-derin(\i l>f'h ivIf r.iJ v.iriabler. in ofton ouito k-^T- 
suasive. This is particularly tru*- in cas( wnerf the behavioral dim''»n,ion 
under study has very narrow breadth or "banr wi 1th." After all, a beliavior- 
ist mi^ht arRue , what can be a more valid i rasine the rate oi a child's 
hitting in the classroom thar a straight-f c:-vard , accurate count of tliat 
hitting. While this argument is persuasive , two counter arguments must be 
considered. First, because of alji ol the rietb.odolO);ical problems which we 
have presented thus far, we can never he c^Ttain that the observed rates 
during a limited observation period are corroletely valid or generalizahle 
even to very similar stimulus stiuations. While many of the problems wo 
have outlined can be solved and othersatten jated , it Js unlikely that all 
will ever be completely eliminated. Second, is it not still of consequence 
to know whether our behavior rate estimates have any relationship to other 
important and logically related external variables? Is it not important. 
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for example, to know whether or not the teacher and clasr.mater. ol an ^ ' ' orvcd 
high-rate hitter perceive this child as a hitcer? It does seem important to 
us, particularly for practical clinical purposes, since we know that people's 
perceptions of others' behavior often have more to do with the way they 
treat them than does the subject's actual behavior. Thf need for establish- 
ing some form of convergent validation becomes even more profound as the 
behavioral dimensions we deal with increase in band width. As we hef,in to 
talk about such broad categories as appropriate vs. inappropriate behavior 
(e.g., Gelfand, Gelfand, 6 Dobson, 19b7), deviant vs. nondoviant behaviors 
in children (e.g., Patterson, Ray, L Shaw, 1969; Joluison et al. , 1972), or 
friendly vs. unfriendly behaviors (e.^-., Raush, j9b5), we are labeling, 
broader behavioral diniensions. At this level, wo arc dealing with constructs, 
whether we i;ke to admit it or not, and the importance of es tablishinr. tlie 
validity of these constructs becomes crucir.i. In most cases, those bread 
behavior categories have been made up of a collection of more discrete be- 
havior categories and, in general, the investigators involved have simply 
divided behaviors into appi^opriate-inappropr iate or deviant-nondeviant on a 
purely a priori basis . While the categorizations often make a good deal of 
sense (i.e., have face validity), this hardxy seems a completely s. isfactory 
procedure for the development of a science '>f behavior. 

We have had to face this proolem in our own research, where we have 
sought to combine the observed rates of certain coded behaviors and come up 
with scores reflecting certain behavioral dj:nensions. The most central 
dimension in this research has been the **total deviant behavior score'^ to 
which we have repeatedly referred in this chapter. Loc us outline here the 
procedures we have used to explore the validity of this score. Although 
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we had a pretty good idea of whicii ^rala beiuivloro would be vi,^wpd a. "deviant" 
or "bad" in this cultur*- . we attcmi^t^^d to er.li<ince tiic consonr.ual fact- valid- 
ity of this score by asking 'parcrts of t.ie "nor.nal" children wo ob'_^orvpa to 
rate the relative deviancy of each o^ the coder, we use in o\ir research. Thus, 
in our sample of 33 families of four- and f ive-year-oj d children, we arked 
each parent to read a simplified vM^'don of our codinp, manual and charac- 
terize each behavior on a three-point '^c<ile fr-^rp "clearly deviant" to "clearly 
nondeviant and pleasing." \/e ostal lishcd an aihitrary cut-off score and 
characterized any behavior above thir. cut-off ar, d^wiant. This rcL.ull«.d in a 
list yjf 15 oeviant behavior.* out of a tot<'i of 'jT^ rodos. Tn- c^econa 'Uep in 
validating this score and our inpiicit ■^\<j\l int-nondovi ant dim»n^sion va 
presented In a study by Adkin:, iH i 'oitnsoi Kl^tl^). "ap hc:d alrc^idy diViCl»"j 
our 35 codo.. into positive, jativr^, ,v\(] n utral v onooquence/^ . ThL^ cit-^- 
gorization was done on a pur.^ly a ^ rlorl 1 i^I:- v/Ltl^. ^ little help frorri tli'^ 
data provided by Patterson anu CoijI' vl971) >n t:iO 1 inction of L>orno of these 
codes for eliciting and maintaining; ciiil^ir p/s behavior. We reasoned that 
b'^haviors which paj'ents viewed as more dov <in wou] 1 receive relative- ly 
more negative consequences than woold behaviors viewed as less deeviant. Tc 
test this hypothesis, ViC umply rank order>'^ ec:ch henavior, first by tl , 
mean parental verbal report ocore obt.ained and second by the mean proportion 
of negative consequences the behavior i.'ecer/ed from family members. The 
results of this procedure are pre.'DO.ited in Table 1. Not all 35 behavj. ^rs are 



Insert Table 1 aboi.t h:ire 
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included in this analysis, but the complex reasons for this outcome can more 
parsimoniously be explained in a footnote.^ In any case, the Spearman 
Rank Order Correlation between the two methods o^^ characterizing behaviors 
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on the deviant-nondeviant dimension war: .73. Thic was an encourap.inr, finding, 



but we noticed that the most dramatic exceptions to a more perfect ar,roement 
between the two methods involved the reasonable comraand codes (ccxnmand 
positive, and command negative). These codes are used when the child reason- 
ably asks someone to do something (positive command) or not to do sorrjc thing 
(negative command). Naturally, most parents felt that these innocuous re- 
sponses were nondev.vant. But, behaviorally , people don*t always do what they 
are asked to by a four- or five-year-old child, and since noncompliance 
was coded as a negative consequence, it seemed that this artifact of our 
characterization might have artificially lowered this coefficient. By elim- 
inating these two command cate^^ories frcrr. t^.e calculation, the correlation 
coefficient was raised to .81. 

The third piece of evidence for the validity of the deviant lehrVvIor 
score comes from the Johnson and Lobitz (1972) study already reviewer: ir. 
the previous section. In this study, parer'ts were asked to make their child- 
ren look "good" and "nondeviant" for half of the observations and "bad" or 
"deviant" on the other half. They were nor told how ♦•o -iccomplish this, nor 
were they told what bshaviors were considered "bad" or "deviant." The fact 
that the deviant behavior score was consistently and significantly hiF,her 
on the "bad" days lends further evidence for the construct validity of the 
score . 

While evidence for the convergent or p.^edictive validity of behavioral 
data is difficult to find in the literature, there are sorr.e encouraging 
exceptions to this general lack of data. Patterson and Foid U971) , for 
example, found an average correlation of .63 (p < .05) between parents' ob- 
servations of their children's low rate referral symptoms on a given day and 
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the trained observer's 



tally ^-^i tax / 



vl deviant lehaviors on that d.iv. 



Several studies have found n ic^'i^'ic^irit relat ionshipr hetwc-on boh^ivioiMl 
ratings of children in the ciacnroom and academic achievement (Keyer:., Attwell, 
& Orpet, 1968; D'Heurle, Miliinger, >: Ha^^gard, 195^; Hughes, 1068). T'ne 
data base of these studies is somewnat different :roin that currently err.ploved 



broad dimensions, as opposed to behavior rate counts. For example, dimensions 
used in these studies included "coping strength," defined as ability to attend 
to reading tests while being subjected to delayed auditory feedback (Hughes, 
1968), or "persistence," defined a^^ , . usen time constructively and to 
good purpose; stays with work until finished" (T'Heurle, Melling-er, ;ia;:£;ard, 
1959). Nevertheless, these studi-r. ^'emo'ismt'- t:.e potentla"! for rer.avior 
observation data to provide evidenc e of predictive validity. Two ot:.* r 
':tudie'> (Cobb, 1969; Lahaderne, 1 H ^ yi -Id L^irdlar predictive validil, 
findings based on behavioral rat^- ^:ta, l\:\a<ierri^ (1968; found that -itt^-nding 
behavior as observed over a "/wo-::K:itr ^'^ri'.l, rrovided correlations r^inrinf 
from .39 to ,51 with various standard test: ci achieve:. ent . Kven witri intel- 
ligence level controlled, si^snif ic-.nt corr* lationr l>etween at'-'-ntive : eh^vior 
and achievement were found. Conb ( i'*o^)) o'l - ained similar results in corre- 
lating various behavior rat<^ scores ^ ai ithnatic achievement, i'Ut foupd no 
significant relationship between these behavior scorer- and aci .^-vr^rf^nt in 
spelling and reading. These predictive va]:dity studies are very important 
to the development of the field as they suf.^est that mani:alat:on of these 
behavioral variaDles nay well result in productive changer, in acaaemic achieve- 
ment. 

In our own laboratory, we are explorir.k', the convergent v tlidity of 
naturalistic behavioral data by relating it to measures on sirilar dimen.'^ions 



by most behaviorists because they involve ratings by observers on relatively 
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in the laboratory which include: a) parent ai^d child interaction behavior in 
standard stimulus situations similar to those employed by Wahler (1967) and 
Johnson and Brown (1969), b) parent behavior in response to standard stimu- 
lus audio tapes similar in design to those used by Rothbart and Maccoby 
(1966) and parent behavior in standardized tasks similar to those used by 
Berberich (J970), and £) parent attitude and behavior rating measures on 
their children. Unfortunately, at this writing, most of this data has not 
been completely analyzed, but an overall report of this research will be 
forthcoming. A recent dissertation by Martin (1971), however, was devoted to 
studying the relationships between parent oehavior in the home and parent 
behavior in analogue situations. 3y anc large, the results of this re-.earch 
indicated no systematic relationships betv/een the two measures. The .xjrr.e 
general findings for parents' responses to deviant and nondcviant behavior 
were replicated in the natur-nlistic and the- analogue data, but correlationn 
relating individual parental behavior in one setting with that in the other 
were generally nonsignificant. We don't kiow, of course, whicn, if either, 
of the measures reprt;sents "truth" but this study underlines the importance 
ot seriously questiGt;ing the assumptions usaally made in any analogue or 
modified naturalistic research. As Martin (1971) points out, these negative 
results are very representative of findings in other investigations where 
naturalistic behavior data has been compared to dato collected in more arti- 
ficial analogue conditions (e.g., see Fawl, 1963; Gump S Kounin , 1960; 
Chapanis, 1967). 

Before closing this section on validity, we would like to briefly 
take note of the efforts of Cronbach a his associates to reconceptualize 
the issue of observer agreement, reliability and validity as parts of the 
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broader concept of general! zabili lv . A full elaboration of gonerali zabi li ty 



theory goes far beyond the purposes of this chapter and the interested 
reader may be referred to several primary and secondary sources for a more 
complete presentation of this model (e.g., Cronbach, Rajaratnam, 8 Gieser, 
1963; Rajaratnam, Cronbach, Gleser, 1965; Gleser, Cronbach, 6 Rajaratnam, 
1965; Wiggins, 1972). \According to this generalizabi lity view, the concerns 
of observer agreement, reliability and validity ail boil down to a concern 
for the extent to which an obtained score is generalizable to the "universe'* 
to which the researcher wishes the score to apply. Once an investigator 
i:; able to specify this "universe," he should be able to specify and test 
the relevant sources of possible thi^eat to general! zabili ty . In a typical 
naturalistic observational study, for example, we would usually at least 
want to know the generalizability of data across a) observers, h) occ<n Ion, 
in the same setting, and c) settings. Throup,h ihe generalizability inoMel, 
each of these sources of variance could be explored in a factorial desir,n and 
their contribution analyzed within an analysis-of-variance model. This model 
is particularly appealing because it provides for simultaneous assessment of 
the extent of various sources of "error" which could limit generalizability. 
In spite of the advantages of this factorial model, there are few precedents 
for its use. This is probal^ly more the result of practical problems rather 
than a resistance to this intellectually appealing and theoretically sound 
model. Even if one were to restrict himself to ti o three sources of variance 
outlined above, the resulting generalizability study would, for most useful 
purposes, be a formidable project, indeed. Projects of this kind appear to 
us, however, to be well worth doing and we can probably expect to see more 
investigations which employ this generalizability model. 





Johnr^on and Bolstad 

It should bo pointed our at this point that the ^',eneralL:al)i li tv .tu^iy 
outlined above does not really opeak to the traditional validity ro.pi i r '-m-nt 
as succinctly defined by Campbell and Fiske (1969): "Validity is ro[)n-^,en ted 
in the agreement between two attempts to measure the same trait through 
maximally different methods." As stated earlier, .o fulfill this require- 
ment, one must provide evidence of some form of convergent validity by the 
use of methods other than direct behavioral observation. The generalize 
ability model can, theoretically, handle any factor of this type ^ander -^h^ 
heading of methods or "conditions,'' but the analysis-of-variance model 
employea requires a factorial design. Thus, it would seem extremely dif- 
ficult and sometimes impossible to integrate factorially other methods of 
testing or rating in a design which encompassed the three variables outlined 
above: observers, occasions and settings. As a result of these considerations, 

we question the extent to which one_ generaiizability study, at least in this 
area of research, can fulfill ali the requirements of observer agreement, 

validity, and reliability which we view as so important. Rather, it [s likely 
that multiple analyses will still be necessary to sufficiently establish 
all of the methodological requirements we have outlined fo^ naturalistic 
observational data. These multiple analyses may, of course, involve analyses 
of variance in a generalizability model or correlational analyses as tradi- 
tionally employed. 

Krantz (1971) points out that the basic controversy over group vs. 
individual subject designs has contributed largely to the development of the 
mutual isolation of operant ana nonoperant psychology. Since he measurement 
of reliability and convergent validity is typically based c-n correlations 
across a group of subjects, the operan-^ psychologist may feel that these are 
alien concepts which have no relevance for his research. We would dispute 
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tin's view on the following logical /rounds Rolifihility involves th» require- 
ment for consistency in measurement and without some minimal level of such 
consistency, there can be no demonstration of functional relationship^^ be- 
tween the dependent variable and the independent variable.. Efforts are 
currently underway to discover statistical procedures for establishing reli- 
ability estimates for the single case (e.g., see Jones, 1972). Any operant 
study which inv:^lves repeating manij)ulative procedures on more than one subject 
can be used for reliability assessnient by traditional methods. Once such 
reliability is established, either for th*' individual case or for a rroup, we 
can be much more confident in the data and i Ls meaning. Validity involves 
the requirement of convergence among diffeient methods in measuring the same 
behavioral dimension. V/here the validity of n mea^uirement procedure ha: 
been previously established for a p.roup , cau us*' it with more confidence 
in each individual case.. Where it has not, Lt is .still possible to explore 
for convergence in a single case. We can simply st^e , ior example, if i he- 
child who shows high rates of aggressive behavior is perceived as aggr^^ssive 
by significant others. This procedure may be done with some precision if 
normative data is available on the measures used in the single case. Thus, 
with normative data available one can explore the pos'ition of the single case 
on the distribution of each measurement instrument. One could see, for 
example, if the child who is perceived to be among the top 5% in aggressive- 
ness actually shows aggressive behavior at a rate higher than 95^6 of his 
peers. The requirements of reliability and validity are logically sound ones 
which transcent experimental method and means of calculation. 

These methodological issues, like all others presented in this chapter, 
are highly relevant for behavioral research, even though they may at first 
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seem alien to it as the products of rival sciiools of thought. It hd-. been o 
.argument that the requirements of nound methodolop,y trani^. n^; " choolr." and 
that the time has come for us to attend to any variables w'lich threaten the 
quality, generallzability , or meaningfulness of our data. Behavioral 
data is the most central commonality and critical contribution of all be- 
havior modification research. The behaviorists ' contribution fo the science 
of human behavior and to solutions of human problems will largely rest on 
the quality of this data base. 
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Footnote^- 



1. Tne preparation of tnis ir.ar.ui^crir-l and the research reported therein 
va,; i^upported by research grant N5i i9b33-Oi from the National Institute of 
Mental iiealth. The writers would like to tinank their many colleagues who 
contributed critical reviews of this manuc-cript: Rob^m Dawes, Levis Goldberg, 
Richard Jones, Gerald Patterson, John Peid, Carl Skindrud and Geoffiy Wnite* 
The authors would like to credit Lee Sechrest for first sugfcer^ting 
this illustrative example, 

3* The authors would like to credit Donald Hartman for clarifying 
this as the appropriate procedure for establioning the level of agreement to 
be expected by chance. 

^. For additional Justify raMon jf the use of this statistical proce- 
dure for problems of this kind, see Wiggins (1972). 

5- Several behaviors which are- used in */ne codinp; system are noL in- 
cluded in the present analysis. The behaviors humiiiaue and dependency 
coxild not be included because they ^lid not occur in the behavioi-al sanple* 
Repeated noncompliance and temper tantnims were not \ ed on the verbal 
report scale because they are subsumed in other categories (i.e., tantrums 
are defined as simultaneous occurrences of three or more of the following — 
physical negative, destructiveness , crying, yelling, etc.). Konresponding 
of the child was excluded post hoc because it was clear that parents were 
responding to this item as ignoring rather than mere nonresponse to ongoing 
activity (i.e., it was a poorly-vritten item). 
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Table 1 

Coded Behaviors as Ranked by Two Methods: 
Parental Ratings and Negative Social Consequences^ 



Behavior Rank by 
Parental Rating 



Behavior 
Rank by 
Proport ion 
of Negative 
Consequences 



Mean Parent 
Rating for 
Behavior 



Proper t ion 
of N*^r.ative 
Consequ<^nce5 
to Bohavior 



1 


Wliine 


13 




• i2'» 


2 


Phynical Negative 


2 


1,07U 


.527 


u 


n^^^'tructive 


8 




,35? 




Toase 


5 


l,?ou 


,3B? 


^ 


Smirt Talk 




1,20'i 


, '.00 


6 


Av.-rsive Command 


3 


i-?on 




7 


Noncompliance 


12 


i,27r> 


, ]7S 


8 


Hi -]\ Rate 


16 


1,307 




o 


ifnore 


■11 


1,370 


,?0S 


10 


Yoll 


10 


1,5 ''.7 


,?If) 


11 


Demand Attention 


15 


1,611 


,083 


12 


Nof^ativiiim 


6 




, 375 


13 


Conmand Negative 


1 


1,8:^'. 


. ,"0 


lU 


Dir. approval 


9 


1,870 


,235 


15 


Cry 


lU 


l,''u.2 


,007 


ir» 


Indulgence 


22 


2.0^3 


,027 


17 


Cominand Prime 


27.5 


2.132 


,0v'>0 


IB 


Receive 


18 


2 , /. 2' 2 


r- ^, 


19 


Talk 


23 


2.278 




20 


Comniand 


7 


2,296 


, 355 


21 


Attention 


25 


2,5Sn. 


.013 


z2 


Tourh 


20 


2,6';8 


,0U3 


23 


Inflependent Activity 


26 


:,70U 


.005 


2U 


Physical Positive 


21 


2,7iU 


. 3U 


25 


Comply 


17 


2,753 


,053 


26 


Laugh 


19 


2.778 


,OUU 


27 


Nonverbal Interaction 


2". 


9.833 


,012 


28 


Approval 


27*5 


2.926 


.000 



Spearman Rank-order correlation between columns IS 2 = ,73 (p < .01). 
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